Episode 60:

Architecting a Boring Stream Processing Tool With Ashley Jeffs of Benthos

November 3, 2021

This week on The Data Stack Show, Eric and Kostas talk with Ashley Jeffs, the creator and maintainer of an open source project for data stream processing called Benthos. Ashley describes the journey of Benthos from starting out as just a personal “weekend warrior” project of his to becoming a full fledged project that he’s maintained for five years.

Play Video

Notes:

Highlights from this week’s conversation include:

A brief overview of Ashley’s background (2:47)
Benthos’ creation and the problems it was meant to address (4:01)
Use cases for Benthos (18:25)
Key features of Benthos that make it stand out (22:23)
Adding windowing to Benthos for fun (29:23)
The highs and lows of maintaining an open source project for five years (32:17)
The architecture of Benthos (36:23)
The importance of ordering in streaming processing (42:15)
Gaining traction with an open source project (53:21)
Benthos’ blobfish mascot (58:03)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 0:06
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Welcome back to the show. Today we’re going to talk with Ashley Jeffs, and he is the creator and maintainer of an open source project called Benthos. And it is a stream processing service. And it’s a really, really cool tool and in many ways, has a lot of alignment with what Kostas and I work on our day jobs. Stream processing service written in Go and does a bunch of interesting things. I actually have some technical questions, because after reading through the documentation, he’s made some decisions that are fascinating to me, maybe because of my lack of knowledge. But I think I’m most interested to know what it’s been like to maintain an open source project for five years, especially dealing with something that’s pretty complex. It’s not a JavaScript plug-in, not that those are immaterial, but when you talk about stream processing, and integrating with services like Kafka at very large companies, you’re dealing with some pretty heavy duty technology. So I’m sure that the emotional roller coaster of doing that for a long time has been interesting. And many times we don’t get to see that. So hopefully, Ashley will share a little bit about that with us. But what’s on your mind?

Kostas Pardalis 1:42
Yeah, two things. Two topics. Actually one, of course, like we have plenty of technical stuff to discuss about how a streaming processing system is not the easiest thing to engineer. So there are many trade-offs and many decisions that you have to make there. So yeah, I’m really looking forward to discussing the technical side of things. And then of course, I’d love to hear about his experience of being like a maintainer of an open source project for like five years. And from what I understand he’s the main and like, more than 98% of like the contributions come from him. So he’s very engaged with that. So it’s going to be super interesting to hear from him, like how he does this, and how he keeps himself motivated. And all those things.

Eric Dodds 2:30
Well, let’s dive in and talk with Ashley.

Kostas Pardalis 2:32
Yeah, let’s do it.

Eric Dodds 2:34
Ashley, welcome to The Data Stack Show. There’s no way we’re gonna have enough time to cover all the topics. So let’s just dive right in. Give us just a brief background on you, brief overview of your career, and then what you do day to day,

Ashley Jeffs 2:46
Hello everyone, my name is Ash, thanks for having me on the show. So I’m the core maintainer of a project called Benthos, which I’ve been doing for about five years. It’s a data streaming service. It’s decorative. And the idea is this operationally simple thing. And I started working on that, around five years ago after working in a sort of stream processing industry, which I’ve been doing for about eight years. So this is, I didn’t call myself a data engineer because the term didn’t really exist. But obviously, that’s pretty much what I consider myself to have been that whole time. And now that’s pretty much my job is just working on this project, kind of indirectly. But yeah, my job basically.

Eric Dodds 3:27
Okay, well, I want to hear more about that. One interesting side note, I don’t know if you’ve looked at the Google search trends for the term data engineer. But it’s crazy. It’s like a hockey stick over the last five years, which is really interesting. You can see like, okay, this is kind of people, we’re trying to figure out what to call this discipline. And then, of course, it’s like, formalized now. Well, tell us about Benthos. You started working on it five years ago, it’s a really cool tool. Tell us the details on what it is, what it does, and then especially why you ended up creating it.

Ashley Jeffs 4:01
So I kind of built it defensively. It’s got two main focuses as a project, if you kind of look at it on the website, you have a quick five second glance, it’s basically YAML programming, a stream processor. And the idea is that it’s operationally simple. What I mean by that is, the whole premise of this project is that it’s super correct, in every possible way in terms of data retention, and backpressure, and trying to be the least headachy item in your streaming platform, and architecturally, but it’s quite difficult. And it’s been a main focus of the project pretty much since day one. And that’s because I was kind of working in a position where we were basically inventing the same product over and over again, we have this entire platform of a service that reads from something, does something to it, that’s usually a single message, transform, maybe some enrichments, hitting third party API’s, that kind of stuff. And then we’ll write it out somewhere. And we were plagued with development effort put into migrating services because they were all slightly different. And these weird combinations of different activities that each one was responsible for. And we were just constantly rewriting these things to slightly change their behavior, recompile it and redeploy it, go through all the testing hassle, that kind of stuff. So I was in a position where I was kind of desperate for something to just be dynamic in that you can drive that through configurations for decorative, because these are usually just simple tasks. It’s filtering transformations, some enrichments. And a few little extra bits in between maybe like some custom logic that you can plug in and stuff. But for the most part, it was just stuff that you could just describe in a couple lines of config. But we just didn’t have that tool. So I kind of went on this weekend warrior effort to build what I would consider to be a solution to that problem. Our perspective at the time, was the data was super important. It’s basically our product. So delivery guarantees were very, very strict. And also, we were using Kafka all over the place. This was about eight years ago. So Kafka was I think it was like version 3.7 at that point. We were early adopting it and slowly migrating through this platform. And my take on it was if this thing is a disk persisted, replicated service, that we’re putting all this effort into operationally running, right? Why would I have a service that has a disk buffer that is also operationally complex, like if you get disk corruption or some sort of failure, then it’s a single point of failure, that could then introduce data loss in your system potentially forcing you to do things like run backfills. So why don’t we just have a service that doesn’t need anything like that, it’s, it’s always going to respect the at least once delivery guarantees without any need for extra state, it’s just going to do that based on what offsets it’s committing, and basically, what you call a transaction. And what is effectively the Kafka streams API. So what it’s supposed to be doing is making sure that you never commit an offset, that you haven’t effectively dealt with that message. So it’s passed on forwards, so you don’t need a disk buffer to have that delivery guarantee. And then the other piece of that puzzle was making it simple to use. So the idea is that you can slap a conflict together to create a pipeline. So this service is reading from Kafka performing some sort of filtering, and then maybe applying some sort of masking, data scrubbing, whatever, there’s writing out to NATS, or zero mq or something, I can then take that config committed into a repo. And if somebody comes up to me that Oh, my god, no, we need to, we need to stop writing to NATS, we’re gonna change this to RabbitMQ now, or this filter needs to change, we need to change the logic for that, I can just say, here’s the config, change that, it’s two lines, I can review it and then go. And to me, that was, that was my way of ensuring that I would get to work on more fun things, like the actual stuff that I wanted to be doing at my day job. And then obviously, that naturally progressed to me only working on the boring stuff, because now I’m the maintainer of the service that’s doing the boring stuff. And that’s where I am now.

Eric Dodds 8:33
An attempt to journey into the exciting that ends with a continuation …

Ashley Jeffs 8:38
… inevitably going down the rabbit hole of boredom.

Eric Dodds 8:41
Yeah. Well, you know, I mean, one thing, actually, that one thing we’ve talked about on the show, actually, with several guests is that, and I loved as I was digging into your documentation, you say, in multiple places, the defining features that this is boring. And we’ve had multiple guests who have built really large scale systems, and will ask them about it and they’ll say, it’s kind of boring, but it works really well. And it’s extremely reliable. And so that really resonates with me, because that’s something that we’ve heard a lot. But one question for you. And I know we’re gonna dig into this a little bit later. But there was certainly a point at which you made a decision for the project to be open source. I mean, sometimes when you’re building this, especially to solve a problem that you’re dealing with inside of a company, it can sort of be the IP that exists inside of that company. What motivated you to decide to go open source with the project?

Ashley Jeffs 9:39
Yeah, so everything that I did in my spare time was like a learning exercise. I would always make open source and that was just a habit of mine because, the thing is, if I was planning to make something open source and knew that it was going to get attention, I wouldn’t do it because I was so shy or so nervous about somebody actually looking at my code. But what I did is because I was so cynical about it, nobody’s ever going to look at this, nobody’s ever going to know I put this on GitHub, I would put all my little hobby projects on GitHub. So the idea of making open source from the onset was obviously like a nervous exercise for me. But it’s also you know, the excitement of maybe this is going to help somebody. But the main reason why I mentioned that I kind of built this thing defensively for the company I was working at, there was just so much going on at the time. So this was a company called DataSift. And they were basically signing these fire hoses of social media data, the biggest one being Twitter. And then filtering logic on top of that says, like lots and lots of stream processing. Back when everybody else was talking about Hadoop, as being big data, we were basically processing the Twitter firehose, constantly for hundreds of gigabytes of customer filtering data. And it was this huge platform with all this stuff going on. And we were having to work pretty defensively to keep this thing going. Because our requirements were changing quite frequently, because I don’t know if you’ve realized this, but working with social media companies, as a partner can sometimes be a little bit turbulent. And they can do things like cut you off randomly and force you to pivot.

Eric Dodds 11:16
Or change data without …

Ashley Jeffs 11:19
Or we just don’t want to work with you anymore. Bye! Your business is kaput. Sorry, oh, that’s awkward.

So yeah, we were constantly like having to churn out what the platform is capable of. And the teams were amazing, like the engineering staff at DataSift was fantastic. But it’s still this huge effort. And you’ve got all this technical debt, because you’re constantly having to change all the services and what they can do, and all the capabilities and stuff. So there wasn’t any capacity really to work on something like this, on company time. And in all honesty, I was working on it, in my spare time, for two years, before it was really viable. Because at the end of the day, like if you’ve got, if you’ve got bespoke services that are built to do a specific task to replace that with something generic, it’s just going to be a challenge, like to build all the basic stuff needed to have a dynamic system. And then to get it to perform in terms of stability, throughput, latency, that kind of stuff, is this massive effort. I didn’t know that when I started, otherwise, I wouldn’t have started it. But then it just naturally progressed. It was a hobby project that nobody was really interested in. And then two years later, I come back to the company like, hey, this might be usable now. Can we use this, please? And it had already kind of got a bit of a life on on GitHub at that point. So it just kind of carried on that way.

Eric Dodds 12:44
Sure. And did the company end up using it?

Ashley Jeffs 12:47
Oh, yeah. So they used it a fair amount in a few places where it was an immediate solution to a problem we had, we didn’t just like, nuke all the other services on the platform. It was a very careful effort of we will slowly roll this out in places where we were going to have to do some changes anyway. And then what happened is the company got bought and it was awesome, because we have this streaming platform and the idea was we were going to sort of use that technology throughout. They’re a very data heavy organization, and they have a load of different teams all working on their products. Yeah, the engineering teams there are fantastic. And the thing is, they’re geographically distributed. So they all do things slightly differently. They’ve got slightly different best practices of how they work with their data, or they did at the time, they’re probably more consistent now. But yes, so I had an opportunity then to go to all these different teams and say, hey, you’re looking to interact with our streaming infrastructure, here’s a tool that can rather than being blocked on us as a team, enrolling you on this, and then getting you on-boarded with all this infrastructure changes and things. Why don’t you just run this thing yourself, and you can do it in your own time, and we’re not even in the loop. This service will allow you to interact with all of our stuff, hit these enrichment services, all those things. And it took off. Again, it took a bit of time, because you come to people with this generic service. And I think because it’s open source, and it’s a generic conflict driven service. So immediately people start thinking, is this gonna be like, LogStash? Is it gonna take two minutes to start up? Am I going to rip my face off over the config format, that kind of thing. So people are quite skeptical. So it takes a while to kind of demonstrate to people that you’re going to get value out of this, you’re going to like it. And I kind of became like an internal evangelist for using the service for this thing, this thing, this thing, and when people have use cases, I immediately jumped on it. Because that’s the bread and butter of the project. It can’t continue if I’m not constantly seeing new use cases and new problems to solve. So I kind of tried to nibble on as many use cases as I could.

Eric Dodds 15:07
Do you think that part of that also, I mean, you have an interesting perspective and that you got to have a very practical experience with streaming almost coming of age, right? Because back when you were using Kafka, the idea of streaming, as you’re talking about, is actually still pretty novel. Right? In terms of technology. So do you think also, to some extent, the adoption of streaming technologies is a little bit hard, like evangelizing use cases, in part just because streaming was still younger?

Ashley Jeffs 15:47
To an extent. Yeah, so it was kind of weird because I started working with some teams, and basically got then forced to work in a batch mode. Because there were use cases where it was like, we’ve got an S3 bucket, and we just want to consume the entire bucket, and then write it to Kafka. Because all the other teams are using Kafka. So it was one of these situations where I didn’t really think about it at the time, like, oh, they’re using batch, this isn’t a batch product, I can’t do that. It was more just a technical problem. That’s pretty easy to solve. Basically, it’s an input, just like any other streaming input; once you’ve finished, the book is exhausted, you shut down gracefully, like, it’s not massively complicated. So there was an aspect of you have to do stream at this company, because that’s the data bus. That’s the data infrastructure of this company, we cannot do all we want to do in a batch way, the volumes are just too big. So this is how we’re gonna solve that problem. And I mean, nobody at that company that I interacted with was particularly intimidated by stream. They were all excited to, you know, play around with this new tech.

And then the thing is after that, so the project kind of grew externally as more organizations started adopting it, I have never been in a position where I’ve had to convince anybody to streaming, because they’re just coming to me, they’ve already got this infrastructure, and they’re looking for something to solve particular problems they have, and they’re stumbling upon it. So if somebody asked me, how do you convince a company to adopt streams? I’ve got no idea. I have absolutely no clue how to do that.

Eric Dodds 17:26
That’s a really helpful perspective. And I think it’s especially in the context of social media data. And I think some of the other components of things that Meltwater provides as sort of data products, I would guess, actually, now that I think about the demand for streaming was probably unbelievable, because when you’re dealing with that nature of data, social media platforms, streaming real time and getting updates, as soon as you can to see trending is probably super important. Well, I have been monopolizing this. I have a million more questions, but Kostas, we talked about some really interesting technological questions. So jump in.

Kostas Pardalis 18:03
Yeah, it’s been a very interesting conversation. All right, let’s try to dive a bit deeper into the technical side of things. And my first question is, can you give us a typical setup, including Benthos, with how it fits with, like, let’s say, the pretty much standard data stacks that we see out there? How do you see it deployed?

Ashley Jeffs 18:25
It’s often used as a plumbing tool. So imagine you’ve got, you’ve got Kafka infrastructures, it’s very often Kafka that people are using it with. There’s also MQTT. It seems to be a growing use case. But it’s normally a company that’s already doing some stream work. And what they’ve got is some services. They’ve either got other queue systems that other teams are using. So we want to share data with some team from another company that could be just another team at their company. And they just do things differently. They’ve got a different schema, they’ve got a different stream technology, whatever. And they just want some simple tool that they can just deploy, they don’t want to invest too much time into this partnership, maybe it’s a temporary one, maybe it’s going to change over time. So they just want something now that’s going to solve that problem. They don’t have to think about it, it’s automatically going to have metrics and logging and that kind of stuff. It’s low, low effort, basically. And then what tends to happen is you start using it that way, sort of defensively, and then you realize, Oh, hang on a minute, we’ve got this other service that’s just reading a topic and then doing some HTTP enrichment. Or maybe it’s calling some Python script or something. And all it’s doing is taking a payload, modifying it slightly, and then sending it on somewhere else. We could just do that with this Benthos instance. So why don’t we do that? And then it just kind of slowly grows from that point where you delete a project that you had to maintain and you’ve replaced it now with a couple lines of config. And it all fits in this one service, that’s kind of neat; you can deploy as many of them as you want, because it’s stateless. So it’s just, it’s just low effort. So it tends to begin with just a silly plumbing mechanism from one thing to another, maybe it’s a bit of filtering or something that somebody wants. And then they slowly grow. And maybe eventually people branch them out into different deployments with different configs and stuff, but they’ll be doing things, I tend to call it plumbing, I don’t really know if we’ve got like a good term for it in data engineering, but it’s not, it’s not a clever task normally. It’s usually single message transforms, and integrating with different services. So you might be hitting Redis Cache or something, to get some hydration data based on something like a document ID or something. Or maybe you’re hitting a language detection service on some of the content of a message, that kind of thing. And then enriching the data with that, that kind of stuff. But it’s stuff that can sometimes be considered to be quite complex problems. And the reality is, it’s not; it’s just an integration problem. And you can put that in a nice config. And then when things change, when somebody says, hey, our service is going to change, we no longer support that field or that thing, or is the new schema, then you just do a quick change, commit that, and it’s simple to test.

Kostas Pardalis 21:20
That’s really interesting. So you said that it’s very common to see it working together with Kafka, right? So Kafka, okay, there’s a whole ecosystem of like, tools around it, right? How is it used together with stuff like Kafka Connect, for example, right? Which has, like, its purpose is, in a way like to connect Kafka with other services or with other streams, then you you can deploy technically some, let’s say, processing logic on top of Kafka, so you can process like, the data that’s, like, it sounds like on Kafka, you can do everything inside Kafka at the end, right? Or at least that’s what Confluence wants to happen. So why do you think that someone who already has invested in the Kafka infrastructure, would also use Benthos? I understand after you start using it, why you keep using it, and increase, like the use cases that you cover? But what’s the first thing that will convince someone to start using Benthos? Does it make sense what I’m asking?

Ashley Jeffs 22:22
Yeah, I think so. So I think the main selling point, I think if somebody’s got, they’ve got JVM components, and maybe they’ve got Kafka, maybe they’re using Apache Camel or something and then they’ve got some other logic on top. I think what tends to happen when people pick Benthos, I mean, it’s kind of difficult to summarize, because I don’t get an awful lot of feedback from the community often, but it’s normally an engineer that’s making the decision, like a data engineer in this context. And I think their main frustration is, they don’t like building stuff. They don’t like having a build system for these transformations they’ve got, especially if it’s really simple stuff. And especially so if they have to change it often. And they don’t like the weight of some of these components, they’re a little bit clunky, they’re a little bit awkward to use; they want something that is more friendly to an ops person. So like if you’re on call, and you’re waking up at 3 a.m., and something has happened, maybe your servers crashed or something. And it’s part of the infrastructure, you can see in your graphs that you’ve had some sort of outage, like the horror stories of some of these components and waking up and thinking of how can I recover all of these different things to solve what the problem is, when they see this product, that’s just it’s just a single static binary doesn’t have any state, you can restart it on a whim. In fact, you can restart it constantly if you want, there’ll be no data loss. When they see an outage, it’s a simpler problem, because you don’t have to coordinate a backfill and have to coordinate all these components slowly coming on over time, you just they probably already restarted if your infrastructure is set up for that. And you can just check on the graphs, the metrics and things that it’s worked. If there’s a problem, you’ve deployed something that is broken, then it’s just a config change. So anybody can look at that and get some idea as to what’s going on. And they’re not reading code, they’re not looking at something, they’ve got committed to some CI system. And it was a full build that got deployed, they’re just looking at a config change that got deployed. So maybe there’s some mapping or something. And they can just roll that back. If it looks wrong, that kind of thing. I think it tends to be engineers that are making the decision. And obviously a lot of Go developers. I didn’t mention that it’s written in Go. So a lot of people are already writing Go services, it’s a natural win for them because they can write their plugins in Go rather than Java. But in terms of the feature set, it’s a lot of overlap with a lot of products that already exist in the Java ecosystem and more popular, they’re more widespread. So I’ve never gone after people making those deployments. I would never tell somebody if you’ve got a happy system that you’re using, and it’s using all these products, I would never tell them, you should ditch all that and use this thing. And if you’ve got a bespoke service that you’re happy with, and it’s doing all the stuff, and it’s your code, and you’re building it, keep it. If you’re happy with it, and it’s solving the problem, then you should definitely keep that thing rather than replacing it with this weird thing you’ve never seen before. It’s this trade off between deciding what you want to work on and what are your priorities as a team.

Kostas Pardalis 25:31
The declarative side of things is also quite important. I think it fits much more naturally in the workflows engineers have. You mentioned quite a few times, you can write like a config, and I can review it, right, this thing of I can review it, and then we can move fast, and we deploy things. And we can change things fast like this has a crazy value. When you’re talking about an environment that needs to be alive all the time. At the same time, you have to create a new logical unit because things are changing constantly and all that stuff. And I think that’s also like a very interesting part of the data engineering as an engineering discipline, because it’s this kind of crossover between software engineering, but Ops at the same time, you have all these different like facets of that you have like to do at the same time, and you really have to pick the best from each one and try like to create tools that they combine the best practices from there. So I think that having this declarative way of describing what should happen there, it has amazing value. So I can understand that, especially having worked with a JVM based infrastructure. So how would you compare Benthos compared to other streaming processing platforms, what are the differences in the similarities between the two.

Ashley Jeffs 26:46
So Benthos is much more focused on single message transforms. So you get a single payload and you’re doing something, you might have a batch, you can do batch processing, say, like, a consumer window of 100 messages and aggregate them. But it’s bread and butter really is single message things. And the reason why I’ve focused on that is because at the time, the problem that I had was just single message stuff. And there wasn’t really an awful lot of attention on that in the product space, we already had Spark at that point, which was already solving, you know, the problem pretty well, from what I could tell. I hadn’t used it, but it seemed like okay, windowed aggregations, that’s that’s a solved problem, we have a tool for that. And what’s the nice thing for masking filtering transforms enrichments, hydration, that kind of thing. So I think if I was going to compare it to these products, I would say it’s kind of, it’s probably more similar to Apache Camel. And obviously Kafka Connect as well, to an extent. And then the main difference is that it’s kind of decorative from the onset, like people say cloud native nowadays, but basically, it can be deployed in Kubernetes, essentially, without much hassle. And that kind of thing. But then Camel’s got Camel K now, so I mean, those services are becoming nicer to deploy. But not like the kind of things that you can do with the Benthos config with the way that the config is structured, you can do crazy things like you can, you can have multiple inputs, fed into a single pipeline with their own processes, and then have joined processes, you can have multiplexed outputs switched on the contents of messages, you can have fan out all these different brokering patterns, round robin, can have dead letter queues for processing errors, and also when outputs come offline, more that kind of stuff. So it’s much more centered on plumbing, which is why I kind of put it in the sort of Camel category, even though it is a stream processor, does stream processing. So it tends to get compared a little bit more with things like Flink and stuff. It can do window processing. But that’s not really what it’s for. It doesn’t have state necessarily; it does window processing just by keeping it in memory and only committing offsets when that window is flushed. So it’s not. I haven’t done any, like performance comparisons in that, because it’s kind of experimental at this point. But it can do it. I wouldn’t sell that feature at this point.

Kostas Pardalis 29:16
Yeah. All right. And then why did you decide like to implement windowing on the platform,

Ashley Jeffs 29:23
Same reason I did most of the stuff, I just thought it’d be fun. There’s a lot of stuff in Benthos that … it’s called a stream processor, and people will look at it and what I reveal on the front page is a stream processor, reading from a streaming system, does some stuff, writes it somewhere. But there’s a lot of stuff in there that does not fit the stream processing category. You can use it as an HTTP gateway, if you want it to, it supports request response. I had to put that in because of NATS and also ZeroMQ and stuff like that. So it’s always had the ability to do responses to inputs. So you can just hook it up as an API gateway. It has an API For dynamically mutating streams and having multiple streams, you can use Benthos to drive itself. There’s like loads of stuff in there that doesn’t really fit the category. So I thought, well, I might as well put windowing in there as well.

Eric Dodds 30:11
It’s really fun to just hear in the world of technology and data technology, especially when you think about a sort of, like, San Francisco based companies that are, you know, trying to become really big, there’s a lot of talk about product strategy, and all this sort of stuff. And it’s so wonderful to hear like, I did that, because it’d be fun and it brings me great joy.

Ashley Jeffs 30:37
It is a survival mechanism to an extent because you’re doing a lot of this stuff on your own steam, I’m maintaining this project, just on my own will. So in order to do that, you have to have fun. There is no way of maintaining an open source project, especially in the early years. It’s just not possible if you don’t enjoy doing it to an extent, or at least I wouldn’t want anybody to suffer that experience if they didn’t enjoy it. Because there’s no guarantee of anything, especially with open source. But also any business. Running a business is the same thing. There’s no guarantee that it’s going to end up anywhere where anybody is going to use it. It could fizzle out; it could disappear. You could just get burnt out and not want to do it anymore. So if you don’t enjoy it, then what’s the point? Like there’s no point in it; you’re just punishing yourself.

Eric Dodds 31:27
Sure. Well, yeah, one question there. which I’d love to just, I think looking in from the outside, sometimes it can be hard to tell what the actual experience of building and maintaining an open source project like Benthos is like, but could you just tell us about some of the highs and lows over the past five years on sort of your working with and on and consulting around Benthos full time now. But what are some of the highs and lows that you’ve been through as you’ve maintained the project. Which by the way, I think also congratulations are in order, because that’s a long time for a project that is still being used at a large company. So congratulations, because that’s a huge accomplishment.

Ashley Jeffs 32:17
Thank you, I appreciate that. So the highs are hearing that it helps somebody. When somebody gets excited about the fact that it solved this issue for them, I get a deep satisfaction out of that. And you don’t get it an awful lot with open source, because at the end of the day, most people are going to silently download it, use it and you’ll never hear from them again. Especially if they’re happy. The happier they are, the less you’ll hear back from them. And you know, I’m not, I’m not judging anybody for that. I do the exact same thing. I can’t complain, because I use loads of open source projects. And I’m not, I’m not emailing the maintainer that I really enjoyed it.

Eric Dodds 32:56
What an unvirtuous cycle.

Ashley Jeffs 33:02
Those are the highs, when somebody actually bothers to say, hey, this really helped, and we can now focus on this thing that we want to be doing, you got rid of all these issues for us, thank you for making this thing. Or if somebody asks for a feature, and I get it out to them quick and so thankful for Oh my god, it’s amazing. Thank you so much. Especially if it was low effort. If it took me like five minutes, and they’re like, Oh my god, it’s amazing, you’re incredible, that I get a lot of self-satisfaction out of that.

The lows are obviously bugs, like if somebody has had a bug and they’ve had some sort of suffering, the behavior hasn’t been quite what they expected or something’s broken or whatever. So I have a thing: I can’t just leave a bug. I’ll tag it, I’ll label it on GitHub as a bug, and it gets closed that day. I can’t deal with bugs being known and not dealt with. And that’s mostly just, I just can’t handle it, I won’t be able to sleep. Which is, it’s great because it means that I deal with them. I don’t have a backlog of bugs that are constantly getting worse or interacting with each other. But obviously that has a toll. Sometimes I just want to enjoy my evening, and a bug arrives. And now that’s my evening. Like there’s nothing else I can do about that.

To be honest, when you deal with bugs really quickly, it does have an effect. I think there’s obviously lots of blogs out there about dealing with bugs as a team and stuff and how you should prioritize them and all that stuff. And I think that obviously I wouldn’t tell everybody to deal with bugs as soon as they’re known, because that’s just not practical, but it definitely has had a positive impact on the project.

The other thing is, whenever anybody has a question, if there’s a question that isn’t already answered in the documentation somewhere, I consider that a bug. And I will, you know, try and make an effort to fix that either with a guide or flesh out the component docs or something, making some example or whatever. And that has been positive, because obviously, as a solo maintainer, you only have so much time so you can’t be answering questions constantly. So it’s a defensive move in a way to always treat big questions as a bug and just deal with them quickly, but those are the lows because I have to deal with it. And it’s me, it’s a personal issue with me, I could get a therapist and I could deal with that; I’ve chosen not to at this point, because it’s not, it’s not that big a problem. It’s not as if it’s every evening, again, like a bug a month or something.

Eric Dodds 35:37
I guess if you were constantly missing dinner, it would become an issue and then maybe, maybe you would call the therapist.

Ashley Jeffs 35:44
My wife would not have that. She would not have me missing dinner. What I would do is I would go and eat begrudgingly. And then I would come back. Doesn’t get in the way of family functions.

Eric Dodds 35:57
That’s great.

Kostas Pardalis 35:58
That’s great. So okay, let’s go back to the technical questions again, and then we can come back to open source because we have like, quite a few questions to ask there. So let’s discuss a little bit about the architecture of Benthos, how you architected Benthos. And what are the main components and give us a little bit of like, insights of the choices that you’ve made in the trade-offs there and why.

Ashley Jeffs 36:23
Cool, okay, so the main premise of Benthos is an architecture is that it’s, I kind of called it a transactional model, transaction means a lot of different people. Now, unfortunately, because I used it as a very general term at the time that basically, all inputs in Benthos, obviously, there’s lots of them, and they all have different paradigms for how to deal with acknowledgments and things. And obviously Kafka being the one that’s most different to all the others. And that it’s just a numerical commit. But basically, every input within Benthos has a mechanism for acknowledgments, unless it’s lossy, could be TCP or something. But they, whenever they create a message, by consuming something, it gets wrapped in a mechanism for propagating an acknowledgement from anywhere else in the service back to that input, where it knows how to deal with it. And then it pushes it down a pipeline, which is Go channels. I could go about Go for hours. Basically Golang channels are used heavily as a way of essentially plumbing different layers of the service. Because it’s dynamic, there could be any number of processing threads. For the sweet vertical scaling, there could be any number of different inputs feeding into one or more outputs. So what happens is, the message gets wrapped in a transaction, gets sent down to channel, which is also the mechanism for back pressure. If there’s nothing ready to deal with that message, it can’t go anywhere. And then essentially, that makes its way downstream. So it goes through a processing layer, they receive transactions of messages, they actually receive the message batch. But usually if you’re reading a non-batched source, then it’s a batch of size one. But all the processes can do whatever they want. And if they filter it intentionally, so it gets removed, they call the acknowledgment, and then the input will do things like send that acknowledgement directly back to, if it’s Google pub sub, then it will act that, If it’s Kafka, it’ll mark the offset as ready to commit. The important thing with Kafka. I’ll go back on that, because there’s a whole topic around how the Kafka input works. But basically, it eventually makes its way to the output layer, the output layer could be broken, you could have multiple outputs, they could be multiplexed, by logic based on the message. So what has to happen is all the different brokering types in Benthos, they’re kind of composed so that the generic components themselves, you can compose brokers, on brokers and brokers if you want to. But they are responsible for essentially enacting the behavior that a user would expect by default. So if it’s a switch multiplexer, you’ve got five outputs, a message gets routed to three, the message is not acknowledged until those three outputs have confirmed receipt. Obviously, some outputs are better at that than others. And then obviously, you can tune them to an extent, so with Kafka, you can choose whether or not it’s reporting all the replicas written to or not, but basically, you have some way of knowing that the message is successfully written somewhere, then it gets acknowledged, and then it’s up to the input to do whatever.

So most inputs, except for the easy queue systems like NATS and GCP pub sub and stuff where ordering isn’t isn’t as important, people don’t really consider that when they’re, when they’re processing messages from those. You can just keep pushing messages down the pipeline, and if there’s capacity, then it’ll get processed. If there’s back pressure on the output naturally it makes its way up to the input pretty quickly, and then when it’s freed, the components gracefully resume. With Kafka, by default, partitions, topic partitions are processed in parallel. So if you’ve got 10 processing threads, and you’ve got 10 topic partitions, you’ve potentially got 10 threads saturated, not necessarily if they’re not balanced well, but in theory, you’ve got 10. But messages of a partition are processed in order. So your options are you can batch them and process multiple messages of a topic that way. Or what you can do is you can increase I call it like a checkpoint limit, that basically how many messages are you willing to process out of order. And what I do there is I keep track. So if you say like, we want to be able to process 100, messages, a-sync, whatever order, we don’t really care about that, we just want to process them fast. I limit the number of messages and I track which offsets we’ve actually, we’ve actually acknowledged. And I will only commit up to the point where all the messages from that commit number down have already been acknowledged. So there’s potential there for duplicates. So say you process 100 messages. The first one that went through the pipeline for whatever reason hasn’t been acknowledged, x is blocked somewhere, all the others have, well guess what, none of them have been committed yet until that final one has gone. And that ensures that when you restart the service, you don’t get data loss. But then the trade off there is that you can potentially get duplicates next time you start it. So it’s like the difficulty with a service like this is finding the common mechanism that’s going to satisfy all these different input types with all their different ways of handling acknowledgments. And what they’re typically used for as well. Because obviously, some people might want to do ordered processing with, you know, a case from like NATS, but then most people don’t really care. So you can kind of enable it. But by default, you’re just going for throughput and vertical scaling. Whereas Kafka, typically, people care about the ordering, and they want to do batched processing of some type. So you can manage it that way. But essentially what I’ve got now, I’ve had to refactor the components multiple times to make sure that I could do all this stuff that basically they all kind of fit their own paradigm now. And, yeah, I think I probably missed a million things there.

Kostas Pardalis 42:05
fine. It’s fine. But I have a question. How important is ordering based on your experience with streaming processing?

Ashley Jeffs 42:15
That’s a good question. So for me personally, it’s never been an issue, because I’ve never worked on a system that actually cared. In event sourcing land, then it’s super important. I would imagine. I’ve had people come to me and have a discussion about how we can guarantee ordering? What about in the event of failures and stuff? If we’re retrying messages, how do we guarantee we’re getting the right ordering and stuff like that? And I mean, it’s a complex problem to make sure in all cases, every single edge case, you’ve definitely got the correct ordering. But I think it is possible, just like a perfectly secure system is possible. But yeah I think it is doable. But I think mostly I would attribute that to event sourcing. So you’re processing a stream of actions, and you need to make sure that they’re done in the right order, because it has, obviously, an effect on the outcome. But yet, to be honest, I would have normally traditionally described benthos as a system where it probably doesn’t matter because you’re doing single master transforms anyway–enrichments and stuff. But then obviously, if you’re using it to bridge between services, and something downstream does care about ordering, then obviously, it also has to respect ordering. So I think some services have gone down the path of not really caring about ordering too much. And maybe there’s a way of dealing with it. I am tempted in the next major version to reconsider whether or not I make it default, because obviously it does make scaling easier for people. If just by default, it is, it doesn’t really care, and it’s you know, letting you use however many processor threads you’ve got. But for now, it’s strict on ordering until you give it the explicit instruction to allow it.

Kostas Pardalis 43:59
And based on your experience, again, what’s the main trade off that you have to change in order to have ordering right? Is it just performance? Is there something else? You mentioned something about duplicates? So there are differences there with delivery semantics also like, so what are the main trade-offs that an engineer needs to have in their mind when they opt for having strict ordering?

Ashley Jeffs 44:23
If you don’t care about delivery guarantees, then the main problem is just throughput. How easy is it to do vertical scaling if you’re forcing order processing, and you’ve got a limited number of topic partitions? Because that’s tied to your Kefka deployment, like the number of partitions is something that somebody else has probably made the decision of; you might not even have control over it. So you are on the processing side. Oh, I’ve got 24 CPU cores. Lucky me. If there’s only three partitions, and you’re doing order processing, then you’re stuck. You’ve got three CPUs unless you can vertically scale the individual message processor, then you’re kind of out of luck on that. But if you care about delivery guarantees, the forced ordering only makes it in terms of Benthos, to a Benthos user, it just means you’ve got to configure one extra field essentially to kind of manually determine how much parallelism you are willing to go for. Because messages aren’t persisted by the service, what it’s doing is it’s making sure that it’s never committing an offset that would result in one of the messages that hasn’t been finished yet being lost forever. So the reason why you can potentially get duplicates there is because if you choose to process messages out of order with Kafka, then obviously, that means the messages that came after a particular offset could be finished and dealt with the next services already got them and they’ve gotten new life in the suburbs, whereas some messages are hung up or whatever they haven’t been dealt with, for whatever reason, you cannot commit that offset. Because any other act, if you commit that offset, or you do anything else with it, then the next time the services have restarted, you’re not going to consume those messages again. So like the whole like, basically with Benthos, you have to be strict because I’m not, I’m not maintaining a disk persisted buffer or anything like that. So those messages don’t exist anywhere else. I’m using Kafka’s disk persistence for that. So yes, it’s one of those things where my role is to basically document what’s the symptom of doing that? And like, if you want to get better CPU scaling, what is the solution to that thing? Because right now, there’s a guarantee that you might not want or you might not care about.

Kostas Pardalis 46:42
So do you have any plans to like, or you’re considering adding, like some kind of state that could like help with these kinds of situations? Or you are like, absolutely, you have absolutely decided that it’s going to be stateless, like Benthos is going to be stateless.

Ashley Jeffs 47:01
Before I went to version one, for like three years or something, I did have a disk buffer as an optional. So the reason why that’s particularly useful is if you’ve got a chain of lots of services that are synchronous, so imagine you’ve got HTTP to HTTP to HTTP to ZeroMQ or something, because of the acknowledgement system, there’s no disk buffer in any of those individual components, it means the acknowledgement has to propagate all the way up. So it’s the same problem that people get with massive microservice architectures where the service that begins the request chain has to wait forever. And any disconnects cause a duplicate. So I did have a memory buffer, a memory buffer is still in and I had a disk buffer as well, I got rid of it, because I thought, well, I’m not sure anybody needs it, I just want to see if I can get away with not having it. And nobody asked for it back. So it’s actually still there in the codebase. Because I wasn’t sure if somebody was using it like a library or something. So I’ve left it there just in case it’s being used in somebody else’s project. But it’s not, it’s not in the codebase. And to be honest, I think I like the idea of having to solve … essentially, in order to not have state, in order to not have this operational complexity of something that a person running the service has to know about. We’re using the disk for this thing. So don’t delete that. And if the disk is corrupt, you’re gonna have to follow this step, this step, this step. And if the server crashes, you’re going to have to do a backfill. And we don’t know for how long in order to avoid that. The burden is on me to make a stateless version of that same feature functional. And it normally ends up just being, I’ve got to be more considerate with how I do things. So in the basic Stream Processing world where it’s just about acknowledgments, the burden is on me to solve, having a transactional acknowledgement system, and also being able to vertically scale and also being able to do things like batch sends, and all this other stuff. Because when you’ve got a disk buffer, that stuff is easy, you write it to the disk buffer. And when you’re done with it, you delete it from the buffer. It’s more difficult in my world, because in order to do things like get nice backpressure and shutting down gracefully, all that stuff I have to be super strict about when we are going to allow things to close. And what happens if messages haven’t been acknowledged, when we’re shutting down? How are we going to read n messages from the queue system without necessarily acknowledging them immediately? What are the difficulties there for each of the individual queue systems? But I kind of feel like that’s my role as somebody building a generic service, that’s my problem. Yeah, because I’ve accepted that problem. Like I’ve accepted the role of giving you this generic tool. And therefore if I didn’t try my hardest to make this thing stateless and easy to deploy, I haven’t really done my bit, I have not fulfilled my role. Like if I just give you a service that’s as complicated as something that you would have made easily is, and the config system is just as complicated as your code would have been. Just use your code. Like why would you involve me in the equation at all? I’m not doing anything for you. I’m not fulfilling any purpose here. And so why do I exist? I ask myself that every day.

Eric Dodds 50:29
Well, that’s a whole other podcast episode. But usually, usually, when you encounter something boring, it’s because there’s a lack of opinion. And so this is an ironic situation where the characteristics of being boring are actually because of like, extremely strong opinions that you have to have about the architecture, which is, which is really interesting.

Ashley Jeffs 50:54
It’s like it’s more. So it’s super strict on the most difficult mode of operation. Because I’ve got a lot of people who use it for logging, they just use it for moving logs around from the services where they don’t care about data loss, if I told them, we’re dropping 50% of your messages, they probably don’t even care. They’re just like it’s just logging, who cares? And I don’t even think they know that it’s got these strong delivery guarantees, because they don’t need to know because it’s one of those things where I can be. I’ve basically made a really strict decision to be super opinionated about something. But the important thing is that the opinion is, it’s not really burdensome for anybody. It’s not, it’s not really a problem. And I think that’s kind of where the trick is in these generic services is to have the opinion that is least hands on for people. Because if it was lossy, right, and somebody wants to deploy this, and it does have a mode of being not lossy. But you’ve got to read a manual to do that, it’s a nightmare. The burden is on you as a user to make sure that you’ve plugged all these gaps that the service naturally has to make sure that data is actually going to be delivered somewhere and that you’re not just going to lose it on an outage that you hadn’t foreseen. Whereas on the end that I’m on, whereas everything’s super strict, and it’s locked down, but you do get the you know, vertical scaling and all that stuff. People just don’t realize, like people are accidentally building these really resilient pipelines, unbeknownst to them. Maybe they’re angry about it. I don’t know.

Kostas Pardalis 52:32
I have a ton more technical questions. But we also have to respect the time here. And we really need to discuss open source a little bit.. So I have a question that I want to ask you about. You described how you decided to make this project open source, and it’s been like five years now that the project has been out there. So it’s been like out for a while. Can you describe a little bit how the transaction happened with the project or how you perceive that like the project started getting traction. Was it something that you tried to do deliberately or like something that just happened? Because people were, I don’t know, like organically finding out about it. How did you end up having such a popular project today?

Ashley Jeffs 53:21
So I was really lucky. Primarily, I had successful open source projects before this, in throwing a library over the wall, and it got some stars on GitHub. And people used it for stuff, very hands off projects. And my method was just to write something that I want to use, and I think is interesting. Post it on the Golang subreddit, shout out to the Golang subreddit. And then it might get picked up in some newsletters and that sort of stuff. And then I would leave it because once it has enough eyes, it only needs a few, it will just pass by word of mouth. That was my experience. And I wasn’t going to change the experience because I hate sharing my stuff. It seems ironic because of all the content I put out, but I hate sharing my own stuff. Because I feel really guilty about it. I feel like I’m spamming everybody and going out of my way to force myself onto their screens. So this podcast is ironic, but that was my experience up to this point.

Kostas Pardalis 54:21
We also have a marketeer here whose job is to spam people out there, right.

Eric Dodds 54:28
Very elegant way of describing my job.

Ashley Jeffs 54:32
Your job would give me so much anxiety. But yeah, so I had a project that I liked, like I liked Benthos. After two years, I wanted to use it. But I wasn’t convinced other people felt the same. So I was kind of reluctant to really do much with it. I think I posted it on some forums and things. But I was really lucky because being at Meltwater, they were such a welcoming engineering community that I was kind of forced out of my shell a little bit, I was kind of pushed and encouraged a lot of like, this is cool, you should share it with more people, people should see this thing. So it kind of encouraged me to come out of my shell a little bit and start evangelizing it. That was mostly internal. Then I struggled because I hate writing blog posts, and especially marketing ones. So I just didn’t have the energy to go any further than that. It had organic use in the company; the great thing about engineers, with like, word of mouth marketing is that engineers churn at such a high rate, that you can go to one organization and kind of evangelize this product. Within like two years, half their engineering team has spread to other places. And it’s a virus. Like they’re going to introduce a tool to their engineering friends. So word of mouth is, I think, the main driver of Benthos. But there was one fateful day where I saw I made a video kind of outlining the rough architecture. And specifically what I did wrong, put that on YouTube and put that on the Golang subreddit. And it got picked up by a couple of newsletters or something. And I got a bit of attention that way. And then I tried posting on Hacker News a bunch of times, no success, no interest whatsoever. And then one day, I woke up in the morning, and it’s on the front page, and some random stranger had stolen my karma. And it was right up there and got a load of attention. And that was, I think that was the first time where the attention was enough that after that point, I had a constant feed of new people coming in. Because obviously, the word of mouth is a constant steady growth. But you need something to boost you to the point where enough people are seeing it that you actually have enough attention because I did have people using it up until that point, but it didn’t, it didn’t feel like it was enough to justify investing a lot of energy into this thing. It was a fun hobby project when I felt like it. But I wasn’t gonna like double down on this is definitely something people want. Until I saw that. I think that was kind of like a turning point where I put more effort into growing it and kind of trying to build out the community and things but I would still say that the majority of the growth of the project is just word of mouth. I’m not paying for sponsorships. I’m not doing particularly well on blog posts still. So it’s just it’s just stuff like this, I guess. And then people telling other people about it and growing the community. I think a lot of people see the graphics and then they want to share with their friends and I want to get the stickers. And so that helps spread it a little bit.

Eric Dodds 57:41
We need some background on this. So the blobfish, right? That’s what it’s called, right?

Ashley Jeffs 57:49
It’s a blobfish.

Eric Dodds 57:50
Okay, give us the backstory. I love it. I mean, I kept smiling as I was going through the site and the docs because I would meet a new version of blob fish every time and it’s so great.

Ashley Jeffs 58:03
So all the libraries I used to make. So the things that I did before Benthos and probably the stuff I’ll do after as well, are always accompanied with some dumb logo, because you’ve got to have a logo for your project, right? Otherwise nobody’s gonna take it seriously. And I used to be obsessed with the idea of just having the most unpalatable logo for something. Because it will be included when people vendor their dependencies. So the idea of companies that are serious, and actually have a purpose on this planet, they’re doing something important, having these dumb graphics somewhere on their servers. I just loved it. I loved the idea. One, one of them. One of the libraries I’ve got is a turkey, just looking glam, like it’s just looking glam. It’s just a library called Gabs, and I just loved the idea of people, professional people in a professional environment, relying on this thing and seeing that graphic once a week or something.

Eric Dodds 59:05
You know, you’re probably way closer to being a great marketer than you even realize.

Ashley Jeffs 59:12
I have to say that the more fun I have doing the documentation and stuff the better it does generally, because I think it comes off like people love documentation that’s just not very serious. It’s laden with dumb humor and silly quips. None of my examples are serious in the slightest. They’re all the goofiest dumbest examples I could possibly muster. But the graphics of Benthos being a blobfish was just me finding an ugly animal or traditionally ugly animal. My logo is obviously very easy on the eyes.

Eric Dodds 59:45
Is blobfish a real fish?

Ashley Jeffs 59:49
This is a controversial topic here. So it is a real animal, and it’s got a proper name, which I don’t know lots of people are going to be upset about. I don’t know the right name of this particular fish. And it’s a deep sea fish. So when you’re looking at the picture of a blobfish, it’s actually because it’s been depressurized because it’s in the normal atmosphere. So it’s not a particularly happy way. So really my graphic is a dead fish. But I’ve shied away from it, calling it a dead blobfish. And I just call it a blob. Nowadays, it’s just a blob with a face. Yeah. But that’s the brilliant thing about that particular logo, because it’s a blob, you can put it into all kinds of different form factors. Sure, and different designs, different shapes. It’s perfect for marketing materials and swag.

Eric Dodds 1:00:41
Now who designs the different variations? Because there’s a lot of variations of the blobfish. Who’s the mastermind?

Ashley Jeffs 1:00:51
I do the bulk of them. I’m the brain behind all of the different variants and their particular equipment. It’s normally topical. It’s normally you know, for a particular example. And then my wife has graciously helped me out with a couple of them. She is a graphic designer. And she does that begrudgingly. Because she doesn’t like my blob. She thinks it’s a mockery of her career.

Eric Dodds 1:01:22
It sounds like she’s very supportive.

Ashley Jeffs 1:01:26
She’s supportive. But, she’s not happy about it.

Eric Dodds 1:01:30
Well, this has been so great, where we’re at time here. But this has been a wonderful conversation. Really quickly, if someone wants to check out the project, where should they go?

Ashley Jeffs 1:01:42
Benthos.dev. And if you want to hang out, there is a Discord. There’s a link at the top, “community”, click that, it’ll take you to a bunch of links, you can either join the gophers Slack, where I’ve got a channel on there, or you can join the Discord server, which is all ours. That’s where you can find blob bot, the famous Discord bot, and then me as well on and the fabulous Benthos communities there as well.

Eric Dodds 1:02:05
Great. And if someone were really motivated to get blobfish stickers, how do they do that?

Ashley Jeffs 1:02:11
There are ways of getting blobfish stickers. If you do a blog post and let me know, it doesn’t have to be related to Benthos. Just do a shout out at the bottom of your blog post. Hey, by the way, Benthos.dev, and I’ll give you some stickers. I’m good with that. But you have to give me your address. I don’t know if people trust me with their address.

Eric Dodds 1:02:29
Well, you’re open source and your logo’s a blobfish. So that seems innocuous.

Ashley Jeffs 1:02:35
I’m on the internet. Yeah.

Eric Dodds 1:02:38
I’d much more readily give you my address, then maybe like a marketer or someone.

Ashley Jeffs 1:02:44
From my perspective, I think that’s the wrong call. But we’ll let people make their own minds up.

Eric Dodds 1:02:50
Awesome. Well, Ashley, this has been a really wonderful show. Amazing, amazing project. And best of luck as you continue to build out.

Ashley Jeffs 1:02:58
Thank you very much. Thank you for having me. It’s been fun.

Eric Dodds 1:03:01
There’s so many things from that episode that stick out that as I rolled it around in my mind, I think the thing that stuck out, which we didn’t talk about explicitly, but the world of data internationally is so big. I hadn’t heard of Benthos before we started prepping for the episode, which isn’t a huge surprise because I’m not necessarily the target audience. But they’re just so many teams working on so many different data products at so many different companies. And you have a tool like Benthos, it’s being used, you know, by large organizations solving pretty critical problems. And it was just a good reminder for me of sort of the breadth of the entire market and how important data has become at every type of company so it kind of just made me step back and appreciate that. Because a lot of times you see sort of the usual suspects in terms of names around data processing. Kafka is talked about a ton and all of these different tools and to see a project like Benthos having an impact. It’s like man, it is really a big world. And there’s so many different cool products out there. And I love learning about the specific problems that Benthos solves.

Kostas Pardalis 1:04:19
Absolutely. And it’s especially interesting with Ashley today because if you remember at some point he mentioned that when he started working on this project, his title wasn’t data engineer because data engineer was not a thing back then. Right? Well today like everyone is talking about data engineers. And so yeah, it’s very interesting. There are many tools and there are many tools that actually exist because someone had the need inside the company to automate their job and get more time to work on more interesting things, exactly what Ashley was talking about, right? And that’s like I think part of software engineering in general. I don’t know, I really enjoyed the conversation today. I think Ashley is like an amazing person. He’s a much better marketeer than he thinks by the way.

Eric Dodds 1:05:15
Totally agree.

Kostas Pardalis 1:05:19
I mean, the work he has done with the logo and all the content that he has created and everything like it’s, it’s amazing. It’s amazing. Yeah. So I would encourage everyone to go and check the website, Benthos.dev, a lot of cool stuff, technical stuff, but also like, it’s overall like it’s a great experience. So even if you don’t need a tool like Benthos, go and check it out. Like it is, it’s amazing. And I hope that we are going to have more time to spend with him because he’s a treasure of knowledge around these kinds of very complex systems. And we have many more technical discussions to make with him. So I’m really looking forward to chatting with him again in the future.

Eric Dodds 1:05:59
Absolutely. That’s the show for today. Give us feedback, Eric@datastackshow.com, and we’d love to get your feedback, and any questions that you have about any of the episodes, and we’ll catch you on the next one.

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at Eric@datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 60:

Architecting a Boring Stream Processing Tool With Ashley Jeffs of Benthos

November 3, 2021

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter