Episode 72:

Building Data Ops Into the Data Lifecycle with Douwe Maan of Meltano

January 26, 2022

This week on The Data Stack Show, Eric and Kostas chat with Douwe Maan, the CEO of Meltano. During the episode, Douwe discusses data tooling, open-sourcing, and data houses.

Notes:

Share on twitter
Share on linkedin

Highlights from this week’s conversation include:

  • Douwe’s career journey (3:04)
  • The missing piece in GitLab’s data tooling (7:35)
  • The open-source offering in the data space (12:38)
  • Singer’s connection with Meltano (22:31)
  • How Meltano manages connectors on a diverse codebase (35:21)
  • The data house side of Meltano (39:47)
  • Data house operating versus Airflow (44:06)
  • Meltano’s vision present today (47:02)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Automated transcription – may contain errors

Eric Dodds 00:06
Welcome to the data stack show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run a top companies. The Digitech show is brought to you by rudder stack the CDP for developers you can learn more at rudder stack.com Welcome back to the dataset Show. Today we’re going to talk with dalla who is the CEO of Meltano. And I almost caught myself saying CEO and founder, but Meltano has such an interesting story. It was a project started inside of GitLab, which is a really large company that builds a DevOps platform. And Dawa worked on the project inside of GitLab. And I’m, I’m so interested to hear from him about how Meltano came to be inside of GitLab. We’ve talked with several companies who several guests on the show, who’ve been part of technologies that were spun out. So we talked to someone from Netflix recently, we talked with someone who worked on building hoody, you know, and several other technologies like that. Git lab isn’t quite as big as some of those companies, you know, they recently IPO and so to see this happen, and kind of have it be so fresh. I’m really excited to hear the origin story about Meltano. How about you cost? Is you having built tools that you know, in the ETL space, I’m sure have a ton of questions.

Kostas Pardalis 01:31
Yeah, I really want to discuss with him about like the evolution of autonomous autonome has gone through like transformation as a platform. I mean, many people probably remember it was like and guilty, like a competitor to steal data. And sure, five Tron today, something different. It’s more of a platform, I mean, these new categories that they call back data out, which is very exciting for me, because what he tries to do is like to bring all these best practices from software engineering into data engineering. And yeah, I’d love to see like what happened, like how the project changed how it became like, you know, like a company with raising money right now. And so discuss about open source projects like SR because like Madonna is very active like there. So yeah, I we will have like later things to tell about for sure.

Eric Dodds 02:23
I have no doubt. Well, let’s jump in and talk with doubt.

Kostas Pardalis 02:26
Let’s do it.

Eric Dodds 02:28
dour. Welcome to the dataset show. We can’t wait to talk to you about Milton. Oh,

Douwe Maan 02:32
thanks for having me, Eric. I’m very excited to be here.

Eric Dodds 02:35
Okay, you have such an interesting pathway that led you to being the CEO of Meltano. Can you just tell us a little bit about your career trajectory? How you got involved with Meltano? And then sort of the story of how you became a CEO because it was inside of another company

Douwe Maan 02:54
before? Yeah, that’s right. So Nutana was founded inside GitLab. So if we go a little bit further back, I can kind of describe how I ended up there. I personally got into programming and computers very early age at the age of nine, you know, my father always had computers around the house, and not just stuff running Windows, but we had like Linux. So I always saw computers as something that could be tinkered with. And that was an outlet for creativity, rather than just something that does a thing and you use it when you need it. So from a very early age, I got into programming and through open source, I was able to teach myself a lot of things that, you know, in another time might have required going to college to the extent that by the end of high school, I had built a bunch of web applications. And I had founded a company and through the company I had at the time, which was called stingo, with built products for Bed and Breakfast owners to manage their reservations and their calendar, and communication. And yeah, I guess at the end of high school, I was initially working for a company to build iOS and Mac apps as lead engineer. And then through that company was one of my bosses at the time, we ended up CO founding a company to build a software for bed and breakfasts. So the cool thing is that I was, you know, in high school, early college, there were not a lot of people around me were kind of building products at this level already. So I really looked for my like minded individuals in the Netherlands and European kind of tech and programming space. So I ended up going to a lot of tech conferences around Europe. And in 2012, I think it was probably around the year, I ended up at Ruby, European Ruby Conference in Athens where I was by myself and I was, you know, over lunch, walked up to a table and introduced myself to someone speaking to her because I wanted to, you know, have a place to put my temperature down. And I thought about what I was doing and that it was from the Netherlands. And he mentioned to me that his boss and he pointed to the corner of the room from the Netherlands as well. So I walked up to his boss and I explained to him what I was doing and this bed and breakfast company we had and it turned out I was looking to talking to sipsey but I’m the CEO and founder of GitLab which at the time was this tiny little Dutch company that had been built around a Ukrainian open source software called GitLab. You know, one of These version control code review kind of GitHub like tools. And it turns out that sits parents owned a bed and breakfast in the north of the Netherlands. So his parents became customers of the product that I had basically, you know, from an engineering perspective single handedly built that I won’t take full credit for the company side of things. And coincidentally, Sid, and I kept running into each other at different conferences around Europe in the coming months up to the point where he asked me to join GitLab, just around the time that it was going through its Y Combinator program, and was raising its first funding. And then the timeline kind of worked out. Because the company I was running at the time, you know, I was 18, but my co founders for 35, and 56, or something. So we were at very different risk tolerance levels in our lives. So we decided to kind of wind that down, and I jumped on the chance chance to join GitLab. And for the first year or so I was a software engineer. And then I became responsible also for building out the engineering team hiring more engineers from the open source community in which is always a really great position to be in when you can bring people in that have already kind of proven themselves and their enthusiasm for the product and our ability to Yeah, to to come up with solutions that will help them and others. And then over a number of years, I got into engineering management, up to the point where in 2019, GitLab, had grown massively from 10 people to 1400. And I was starting to feel that itch and want to go back to earlier startup days where you have a smaller team and there’s so much to do every day, you can really feel the impact of the decisions you’re making, in a very short term. And in general, that that way of being at the forefront of solving some new problem and having super happy user. So you know, as I’m sure everyone in the room today is familiar with, it’s really great. So I joined Meltano in 2019, but Meltano had been around since 2018. Meltano was originally founded inside GitLab, because the GitLab data team and GitLab as a whole realize that the state of data tooling was very different from the types of developer tools we had gotten used to that embrace best practices such as version control and code review, and allows entire teams to collaborate on their product in a way that enables really quick iteration and makes it easy to experiment and make sure that people can just make changes without being worried that they’ll break stuff in production. And as an engineer, looking at the state of data tooling, myself, but also other engineers in GitLab, we were kind of surprised to see that a lot of these best practices that we saw was pretty transferable. And a lot of the problems that these teams have, as parallels were not being addressed yet by the tooling of the day. So sorry, go ahead.

Eric Dodds 07:29
Oh, just to just to dig in a little bit. That’s super interesting. And so just to say that another way you were looking at data tooling, so let’s just say, you know, whatever, traditional ETL or streaming or whatever, were those more the challenges were that they were primarily sort of UI based, and like tucked a lot of the a lot of the mechanics under the hood. And so you don’t have things like version control or other sort of, there’s not really like a development lifecycle with data tooling. Is there as with normal software, was that sort of the the key piece that was missing?

Douwe Maan 08:01
Yeah, yeah, we can talk about that a little bit more. That’s great. So get up was relatively late to start setting up its data team. So do initial begin beginnings of that was really just get that engineers looking around and seeing okay, you know, we got to build a data stack, we got to move data from A to B, and we want to analyze it. And they came into it with certain expectations, like, oh, yeah, you know, where developers just shoot, this is all kind of like building an application or building these pipelines. And then what they found this exactly what you’re describing some of the things that they have started taking for granted, even though even in the software development, worlds, DevOps was not really a thing 10 years ago, and you know, I grew up FTP, being into a web server and making life changes to PHP files in production. And that’s fear very much feels like the way the data space is still today, or at least a couple years ago. So the big thing is definitely a lot of these tools being UI based being kind of proprietary SAS tools that run in a browser somewhere, and don’t give you a lot of the flexibility and customizability and ownership and say over a really core component of your stack that developers expected in combination with these tools not being open source, which also ties into being sort of limited by what they do today. And not having that opportunity to improve them or to treat them fits your workouts better. But the fact that they’re UI based, and that they come from a world where you know, companies have these big end to end data tools they log into and they make all the changes in the user interface didn’t jive with these expectations of pipelines or code. Everything can be code version controlled everyone in the team no matter their disciplines, or their their kind of comfort around E L, for example, is able to go in and see the configurations and propose changes trace how data flows through the system by having a full overview of everything and exactly like you’re saying version control, code review, continuous integration and deployment, having automatic tests run so that things don’t accidentally break having isolated environments so that you can make changes locally with complete freedom without ever worrying about accidentally breaking the dashboard CFLs looking at these are things that we were expecting to find ended months so we saw it as an opportunity not to To build an internal tool for GitLab to use, we saw that there was an opportunity in the market here to build data tooling that really embraces at a really deep level, the software development, best practices of DevOps and open source. And from day one GitLab realized that by building a tool that would help GitLab in this way, that would also be able to help people externally. So from day one, the hope was that this would one day develop into its own business unit, its own business, you know, per se, by building something valuable for us to transfer to others. And we saw an opportunity to make data ops or reality, similarly to how GitLab had been pivotal in making DevOps and def SEC ops a reality. So in 2018, when the data team was really small, GitLab set up this team to start building this tool called Meltano, Meltano being an abbreviation for Model, Extract, Load, Transform, Analyze, Notebook and Orchestrates. It’s awesome. It’s some of the stages of the data lifecycle that we identified. I don’t know who it was that put it together in this particular order, but I think Meltano has a really great sound and kind of feel to it and it’s cool that it kind of relates back to all of those aspects of the data lifecycle. But we also saw that GitLab data needs for growing at a pace that the internal team building at Meltano was just not able to keep up with, so GitLab did end up using some of the more traditional tools in the space (five Tran and stitch for EL, a bunch of different tools, we tried out for the BI side of things), but we always believed that the future of data tools, not just for GitLab internally, but also for the whole world would look a lot more like software development tools, and data people becoming more and more comfortable, not with programming, per se, but at least with concepts of version control and command line interfaces, managing your configuration in Yamo files, and the Meltano team never gave up on that goal or that vision for the future.

Eric Dodds 11:47
I love it. I you know, it’s interesting, if you think about some of the more UI based tools, a lot of those are driven by analytics use cases from other parts of the organization. And so it makes sense, you know, sort of the way they were built, you know, sort of the SAS model and tucking the mechanics under the hood and I. And so now, we had a lot of people on the show where they’re trying to bring software development principles into the data space, because they realize the need there. But I just love thinking about the team at GitLab, who’s been building DevOps stuff coming into the database and saying, Whoa, like, what’s going on? Like, yeah, you know, where is where’s all the componentry? Like, so great.

Douwe Maan 12:31
For me, it was really interesting in and for jumping in the timeline around a little bit, but I’ll talk more about how I came to join Meltano. But when I joined Meltano, I was very new to the data space. I knew that clearly there was a need there for something GitLab was building, but I was just really surprised to find also the breadth and depth of the open source offering in the data space. I was positively surprised on some fronts because there are just really great, or you know, set up to be really great with a few more years of innovation, the idols for example, like MetaBase, in superset and reaction, there’s a bunch, there’s DBT is phenomenal as a as a transformation tool that also kind of introduces a lot of analysts to some of these software development best practices. But we saw I saw at some point, I was surprised to see that especially on the data integration side, of course, there exists tools like RudderStack that are kind of focusing a little bit more on what we now call “reverse ELT,” but everyone’s still seems to be using a Fivetran or a stitch and there existed a library of connectors in the Singer standard that had been built around stitch, but a full stack that can replace a five trend, for example, that you can just run open source, I was surprised to find in 2019, that hadn’t been a completely solved problem already. So go back to 2018 when Meltano was founded and cover a little bit of the time and the changes that have gone on in Meltano during that time. When Meltano was founded in 2018, we had this hope of building an end-to-end platform that could do everything from data integration to helping you build the dashboards, end to end from data to dashboards will be called it at the time. And we were on the one hand looking at great open source technologies that we could leverage. And we were also willing to build our own new stuff that would really work well with this software development way of thinking. From day one, this was going to be open source, we were going to build it with the community, and we really wanted them involved not just from a feedback perspective, but also actively helping us make this a reality. But we came to realize over the course of 2018 and 2019, and I joined at the end of 2019 that this end to end vision was too heavy in a way for people to adopt and start using and start contributing because we kind of assumed that you would replace your entire data stack with this Meltano thing, which means meant that we had a lot of ground to cover until we could actually plausibly replace whatever best in class tools that companies have picked so far. So by the end of 2019, when I joined we were working on making the end to end thing work. could bring plugins into Meltano for a particular data source like a stripe or Shopify or what have you, we’re kind of focusing on the the business to business, or to b2c, rather an E commerce field just to have a use case in mind to focus on. And we had built something where you could bring plugins in for these sources. And you could indeed, with one kind of one click go from entering your credentials for stripe or Shopify or one of a number of fields, we support it, and then having a dashboard show up at the end. But we were getting some interest from really early startup founders who didn’t have the resources to build a data team and set up their own stack. But we were not actually getting the interest from the data engineering or data analytics community that we were looking for. So in early 2020, and from GitLab perspective, the decision was made that the numbers that we were seeing in terms of traction and usage did not warrant the continued kind of full time staff of six people on the team at the time, which was a general manager, Daniel Morial myself in engineering leads, and then four engineers, one of them we found out earlier is actually a friend of Costas Yanis, Rousseau’s he’s really awesome. But we realized that six people on a product that was kind of flat in terms of growth was just not gonna work. So the decision was made to reduce the headcount down to one to essentially extend the runway six fold. And I was left by myself on the product, essentially, to speak out how I could turn autonomo around. So over the first few months, that was, of course, super daunting, because I was essentially the newest data out of the entire team. My background is in software engineering. And I realized that I was kind of blind to the needs of data professionals themselves. And I was very aware of whatever all you have is a hammer, everything looks like a nail. And am I just seeing things that aren’t there is a big problem with the data world really bringing more developer style tooling in and making open source data stacks, more of a compelling alternative. So I started talking to a lot of the data people that had become Meltano fans and followers over the years, not users, not contributors, in many cases, but at least people were willing to talk to us about what they liked, and what resonated originally. And I found out that sort of accidentally in Meltano, by identifying these great open source technologies for different stages of the lifecycle, we had found Singer as the standard for open source data connectors, which was built by stitchless. We talked a second ago about which has this ecosystem of at this point, more than 300 connectors for different sources and destinations. And the question I was getting from these users was that like, well, you know, you’re building your own open source bi, but there’s already a bunch of solutions for that. You’re embracing DBT, for transformation, that is great. But you know, DBT is great standalone. But this Singer thing could really benefit from better tooling around running these pipelines, deploying them, configuring them, building new connectors for data sources. So we realized I realized that not necessarily by changing the product, but by changing the positioning to focus exclusively on open source ELT, and look, this is the best way to run Singer and BBT power pipelines on your own infrastructure on your own machine. And you get all of these DevOps and data ops advantages for free because your pipelines are managed in a Yamo file. And, you know, you get you get testing and all of this stuff over the course of 2020, just for the simple act of changing the way the website talked about Meltano was, we suddenly started picking up tons and tons of usage as an ELT tool. Even though from our perspective, no data had always been an end to end platform fix best in class technologies to build integrations with that can run on top of the platform. So by the end of 2020, we had really kind of created the change in the Singer ecosystem that we and the community agreed was needed, there was always this weird situation where stitch itself is a proprietary SAS data integration platform. But the connectors that run on it in many cases are open source and available for free, and you can just download them. But those connectors by themselves don’t give you all of the ETL functionality to actually want to run the stuff in production. And that is where we stepped in to the point where in early 2000s 21, earlier this year, I got the permission from GitLab to start bringing some more people and we started talking about setting meltdown up for best success in the market and really becoming the tool that makes data ops a reality for for the data lifecycle and data teams as a whole. And we realized that since GitLab, being a 1400 person company, were literally, you know, 13 199 people were working on this big thing called GitLab and marketing for GitLab and sales for GitLab. And everything gets up. And I was by myself in working on this tiny little other thing. And we realized that some of the stuff you need as a startup to be able to move fast and make compelling offers to great candidates. GitLab was just not set up to do any more because the reality is in the needs were so different. So we realized that in order for GitLab of rotana, not to be slowed down by the inevitable increasing bureaucracy that kind of come up and get lab. Our best path forward was to spin out so over the course of 2020 s we were gaining traction I had already had literally dozens of VC firms that had reached out so like talk about this eventually like what’s not going to be isn’t always going to be internal is going to be its own thing one day so early 2021 or Earlier this year, we started concretely talking to some of these potential VCs. And that led to us leading a seed funding rounds from GV formerly known as Google Ventures. And that led to my transition from literally in January, I was a general manager of a product by myself. In February or March, I hired two people while you’re still in GitLab. And we were three. And then three months later, I was founder and CEO of a startup that that really quickly built a team to about eight 910 people. And six months earlier, I had just been by myself, so that was amazing. But as you can imagine, also a whole new challenge and opportunity for myself to be pushed to my limits, and then have to overcome them, which of course, is extremely rewarding.

Kostas Pardalis 20:42
Yeah, that’s amazing. I think we should spend some time later to say with us a little bit of like, what this transition felt like because, to be honest, you have quite an amazing was it journey so far from like, as you said, from being a teenager, building apps going like very early on GitLab, being a manager for engineers, and now CEO, I think there’s a lot of like, wisdom, like to say that like, even for us, like the emotional side of things, right? Like how the motions change. But let’s let’s do a little bit later, because I want to ask you about Singer. Singer is a very interesting how to the like, case of open source projects, especially like in the data space, because I had the opportunity to, let’s say, experience, the war between five and six data as it was happening, because I was also competing with them. And it was very interesting how these companies were positioned, and how Singer came into the game to support this positioning that data hub, but states data, left the game a little bit early. They launched this thing, it got traction, then they got acquired by talent. And then we were left with Singer out there were people were keep using it. And when it’s the moment today, right, like all these years that we have, like Meltano, which is building tooling around it, we have error bytes, which is pretty much based on the Singer protocol. And I’m pretty sure we will see more stuff happening around it. So I’d like to ask you, what you how you’ve seen like, first of all, what was like Singer when you first started working with it, and what was missing from it. What was like that six data didn’t do about Singer?

Douwe Maan 22:29
Yeah, great question. So when I came when I really started digging into the data, space, and Meltano. And the tools we had adopted in 2019, Singer had already been the standard for data connectors that we had adopted, because the library at the time was, I think, somewhere in the 100 to 200 range of connectors that were supportive, and there was a community of a few 1000 people around it. And there seemed to be, at least on the more popular connectors in ecosystem, frequent enough updates, that they would be production ready. But from talking to the people, what we realized is that connectors for sources and destinations, just these tiny little executables that you can run on your terminal, and you can pipe them together to have data flow from A to B are not enough to actually replace and you know, an entire ETL solution. And that’s of course, also why stitch itself, the host platform for running the Singer connectors is paid because a lot of the value is not just in the connectors themselves. But in the tooling that manages incremental replication that manages backfills, that manages all kinds of aspects about the real production level by reliability of these pipelines that goes beyond just running the code and Meltano had already built that. The other thing that we saw is that people found it too difficult to build new connectors and to improve existing ones there was there existed this Singer Python library that had a number of helper functions, and most of the connectors were built around this library. But there was a lot of decision making on the site of the engineer as for how exactly to use these, how to deal with incremental replication state how to manage have to deal with selection of specific streams and columns, which are roughly analogous to like tables and database table columns. So we realized there was also an opportunity for better tooling around building these connectors. And then finally, the big problem was discoverability. of Singer that IoT official website for Singer has a list of about 99 connectors. But in most cases, those link to the connectors in the Singer I O namespace on GitHub, where a lot of these repos are housed. And as we’ve been talking about Singer, unfortunately, I think, because of the talent acquisition sort of lost the motivation to really actively maintain these projects. So a lot of these repositories ended up with I mean, dozens of on answered open issues and pull requests and box that had been known for ages but just had not been fixed. Even if you’ve been provided by the community. The plugin you would have downloaded would still have had the bug. So there’s two issues there in discoverability, one of them being that in many cases The Singer IO repositories actually had forks that were more actively maintained. And those are really the ones you should be using, if you want to have the highest quality and everything. And the other part Bart was that the Singer that I only listed these connectors, that Singer at one point had adopted into their own GitHub namespace. There existed hundreds of connectors in other companies, consulting firms, other data products, own GitHub repositories that were also available for free in often cases more maintains, but were not discoverable at all unless you knew how to do a special search on GitHub. So we identified these three issues, building these pipelines and running in production, building connectors, and then discovering connectors. So we just set out essentially to address them one by one to lift up the Singer ecosystem and empower it, not to necessarily own it and make it our own. But to make it give it all the tools it needs to be able to stand on its own and keep growing even without our kind of continued heavy handed involvement. So Meltano itself became this runner that makes it really easy to run, configure deploy, and we built the Meltano SDK for Singer taps and targets that makes it easier than ever to build new connector set of the code footprint of an existing connector that is portable to the SDK is reduced by about 90%. And people have told us that getting a new connector up or running with all of the Singer bells and whistles like replication, incremental replication and and stream and column selections only takes us much of two hours because of some of these abstractions that we have built around rest API’s, Graph QL API’s and other custom methods. And then finally, we meltdown Oh hub for Singer types and targets to get to load all of the different types of targets and ecosystem, which it turns out, there are more than 300 sources and destinations that have Singer connectors for them. And about half of those have been updated in the last year. And the other ones are not necessarily outdated, those might just be API’s that don’t require quite as frequent updates. So the scenery ecosystem is a really great place now compared to how we found it as natano About a year and a half from now. And we have recently also set up the Singer Working Group, which has us in it along with a number of big players in the secret Singer ecosystem, including the stitch team at talents, were first the original creators of the spec, other tools that use Singer in their power their connections like health, glue, and wave 42. And there’s a few others, as well as some of these consulting firms that have built a lot of these connectors over the gears for their clients that needed sources that were not supported by some of the tools like five Tran. So Singer is now at a place where it can in combination with Madonna and other tools, we’ve built rival, five Tran and a lot of these other tools, especially on the size of the connector library. And the advantage of it being open source, which means that you are never limited by anyone else if you want to improve or extend or customize these connectors, or if you want to build a new one for a new source. And interestingly, having put Singer in such a place has actually given Nathanael the opportunity to look at what we’re doing, what our mission is and what our goal is. And to take a step back from this really narrow focus on El, which we kind of took a strategic decision in early 2020 as I was describing, and to focus again on bringing data ops to the entire data lifecycle by building Milltown into a data ops operating system that can form the foundation of every team’s ideal data stack. By allowing best in class open source components for various stages of the data lifecycle. They’ll be brought on top of the LS with the O S, taking care of the consistent installation, configuration deployment and the integration between the various tools. And I can talk a ton about that because it’s kind of where we’re going but it’s good to stand still a little bit on Singer and what it wasn’t what it is today and what we’ve been doing.

Kostas Pardalis 28:45
Yeah, I have a few questions about the future. But I’m sorry, I’m a little bit like curious about like the evolution of Singer right because from what I hear from you, we are talking about okay, we had like singur, the ergonomics of like the as the gears in the last stop, like we’re not like the best you created like on top of that the Meltano is decay or like the extension of how does this to be clear,

Douwe Maan 29:08
we have not extended Singer in any way. So far, we are working with the Singer Working Group on Singer extensions, if you want to make sure that those are supported and approved by all of the different players in the Singer ecosystem. Because we think a big part of its power is the fact that it is no longer purely connected to one particular product or company. Like it used to be when it was just the connector framework for Singer. And for Stitch rather. And simulator. We’re seeing other open source data integration vendors, like somebody mentioned before coming up and building your own connector standards on top of Singer with private extensions, but we believe that Singer is kind of special in that it is agnostic and really community led and everyone in the ecosystem different consulting firms and different tools can adopt it because it is the de facto open source standard without any particular company that owns it today, which is strange. Okay,

Kostas Pardalis 29:58
perfect, perfect. The reason that I’m asking is because like, I’m quite aware of like how the airbike versatile Singer works, which it is built on top of. Singer, it’s not Singer exactly right. Like they have made some very smart decisions in terms of like how the interfaces work, like with standard input output, like between like Docker images and stuff like that, that give like a lot of let’s say they’re operability between like different, like languages and frameworks, stuff like that. But it’s something different, you know, like, it’s not, it’s not exactly seeing it. I mean, there is there are element elements like singing, but I cannot think Mandarin. Outside I like backwards compatibility this thing by now. It’s something like different at the end, right? So that’s why I was asking if it’s something similar at the end what Meltano is doing, or you’re like focusing on maintaining and reviving singer at the end?

Douwe Maan 30:49
Yeah, I The interesting thing is that because of the Singer SS, as a standard is really great, like stitch came up with it, it’s served our needs for a long time. But it also haven’t hasn’t evolved a lot over the years, since they have sort of lost interest. So there are definitely a lot of areas in which it can be improved. But at the same time, a lot of the issues with current Singer or existing Singer connectors, we’re not actually because of limitations and what Singer can do. But just in the fact that a lot of these connectors we’re not even making the most of the Singer can already do today. So we wanted to first address that by making it so easy with the new SDK to start using everything that Singer can already do today to kind of reach the full potential was already there before starting to look ahead and see, okay, how can we make Singer better. So the first important thing for us was to increase the consistency and behavior across different connectors in the ecosystem, especially for newly written ones. And the SDK has delivered on that and makes it so that you can opt into some of these Singer capabilities, without having to completely figure out yourself how to implement them, and automatically leads to more consistent behavior across the board. But now that people can actually make the most Singer through both amo and SDK, we are starting to work on improvements to the spec airbike was in this, you know, in their case, great position where they could just say, Okay, we don’t need backward compatibility, we’re gonna just call it you know, to air by spec, we’re going to take a lot of inspiration for Singer and then we’re just going to fix everything we think is broken and improve it, and they could do so unilaterally. But we think that there is so much potential in the Singer ecosystem and the existing community of literally hundreds of 1000s of consulting firms and different data engineering teams and data product developers, that we didn’t want to just let it go. Because then you get in this position of that famous XKCD comic that says, there are 12 standards, they all suck, you know, I’m going to make a new standard. And then the next frame says, Now there are 13 standards, and then it becomes. So we decided that the only way to really make Singer better is to bring a first kind of increase people’s confidence and trust and believe that this is going somewhere. And through these these these things, we’ve brought them a Singer ecosystem, we have definitely kind of revived that museum. And then the next thing was to get all of the big players and Veselin Singer kind of together in a room to start working on those next iterations of Singer together. And the first priorities for the Singer Working Group and talking about are to address some of the same concerns that airbike has been able to already address because they do sell. But we are starting to do this through a more standardized process where we get everyone involved around the table and also bought into supporting this in their connectors going forward. And and implementing in their tools. So that has to do with things that improve performance or throughput, it has to do with like the automatic discoverability of a connectors configuration features, for example, which is now something that kind of lives separately in the repo from the actual connector. And there’s a whole list of other things that you can find, if you Google, a Singer working group, and you find this repo, we’re working with these players. And we were actually really, really grateful to see that the stitch team at talons was just as excited as us about this opportunity to kind of keep growing and improving this for the benefit of the entire data community. And that ties back to the importance of Singer being seen as something kind of separate and agnostic. And somebody that will always survive as long as enough people use it, rather than something whose faith is tied to one particular product. In part because from autonomous perspective, we don’t want to take over ownership of Singer forever, because we are building a data operating system, we’re not just building an ETL tool. So it’s in our interest for there to be independently thriving open source technologies for every step of the data lifecycle that we can make better than the sum of their parts. But ultimately, it has to be this ecosystem and community around Singer that keeps it alive. And we are happy to have a big role in that and put development resources and everything towards it. But we cannot do it ourselves.

Kostas Pardalis 34:37
Mm hmm. I have a question that I think is also going to lead us into the future of Meltano and data ops. And I want to ask you about how you as Meltano can manage the quality of these connectors and I think this is like one of the biggest let’s say arguments that the closest and like five some have that? Yeah, sure. I mean, you can go get Like download, like something from GitHub. And of course, like many of these, like versions of the connectors are just crap, right? Like, they are not updated. They are not simply made well, like all these things. So how do you deal with that? Like with SASS, like, diverse, let’s say code base?

Douwe Maan 35:18
Yeah. Yeah, it’s a really interesting question. And it kind of goes through the trade off between the decentralized maintenance of an open source ecosystem, where you get a ton of advantages, like there’s not a single bottleneck or too slow, who can slow things down. And the amount of connectors is essentially endless, if you decentralize the maintenance to different kind of investor parties. But that also means that we cannot fix a bug ourselves unilaterally in some particular connector if we want to, because we do not necessarily have ownership over that repository. The way we’re thinking about it is that in any open source ecosystem, you know, if there are enough users who are okay with this deal of okay, I get to use it, but I maybe occasionally have to fix stuff, then the top used connectors will automatically get enough usage and eyeballs that they are in a good states. And for us, it’s more important to have a decentralized ecosystem that can scale indefinitely, then to have a smaller ecosystem that we have tighter control over. But that doesn’t mean that if you are a company that just needs connectors, that will always work. And you never have to worry about maybe fixing a bug yourself Meltano, or rather, Singer might not be the best choice for you today. But the more companies become involved that do this work, the higher quality, even companies that aren’t willing to put in their own contributions can expect to find the ecosystem. And it would want to like the stress that the quality of connectors in the ecosystem is already higher than a lot of people might have thought a year or so ago. Because the best the best variants of a lot of these connectors are in prior in forked repositories, rather than the the initial Singer I O one that you will find and a lot of them are seeing maintenance. So embark to address this maintenance question, we have also set up Nathanael Labs, which is a way of pooling decentralized maintenance, so that people don’t have to take on the maintenance burden indefinitely, but they can say for a period of time, we are heavily using this one, are we improving it for our clients, so we are okay with kind of taking on the maintenance hats for the next few months or so. But then it stays within the Montano laps pool where we have some control over it. But we are not a bottleneck per se. The other the flip side of this, though, is that, you know, in an open source ecosystem, already, web applications used every day, including, you know, rather second data, but also massive ones, like Reddit, and Facebook, and whatever are all built in open source technology that in many cases are also just managed by individual contributors. And you have the same motivation of where the same trade off can we expect that quality to always be there. But we all know that there are high quality maintains API client libraries for all of the big API’s for all of the big programming languages, you can find, you know, Shopify API clients in every programming language. In many cases, these are built even by vendor themselves, or they’re maintained by an active community of maintainers. And if we trust these API client libraries enough to use them in production software, then on the limit, there is no reason to not trust a May an ecosystem of connectors at a similar level. But from my perspective, as a data upsales, we don’t really care which particular technology you bring into Meltano, whether that is Singer or DVT, or even, you know, air bytes, or router stack, we have plans to support all of these in the future. Because we think that it’s up to us to provide teams choice to put together their ideal stack where they can make the trade offs they need. And we will build the data up. So so that kind of ties it all together and allows them to treat their entire data stack as a product in the way of the software product development lifecycle, rather than just a set of disparate kind of tooling and purchasing decisions. So Singer is not going to be for everyone, maybe not ever. But that’s okay. Because there are lots and lots of organizations that do like to trade off of, I can fix it and improve it and customize it without needing to ask someone for permission. And now I’m okay spending a few engineering hours per month to do so just as is the case today with other open source projects? Yeah.

Kostas Pardalis 39:07
Well, it’s a huge conversation. I mean, we could probably multiple episodes, just chatting about how you can structure in this kind of like open source project. It’s, for me, it’s like very, very interesting. And I think there’s a lot of value in there. But let’s keep that for another episode. I’ll be more than happy like to just dedicate one just for this. And let’s get like into the data outside of Meltano. So you mentioned at some point that like Meltano started does an end to end platform. Okay. And it has transition now into data or transitioning into data. What’s the difference? What’s the difference between the two?

Douwe Maan 39:46
Yeah, good question. So when you’re looking at kind of the previous generation of data tools, what you primarily saw is these big products that kind of do it all. They do everything from, you know, the integration to the analytics, and this is potentially a consequence of These tools may be having started with a less technical analytics audience with a BI tool, and then working backwards into the rest of the stack. And they do it all. But they do it all from a kind of a UI Base SAS web browser perspective. And the tools you’ll find today that call themselves data ops platforms are also these types of tools that try to do everything really well while bringing in some of these data ops qualities and software development best practices. But the data space of today is uniquely horizontally integrated in the sense that you have for every kind of step in the data lifecycle, and every layer in the stack, you have a number of competing solutions and new ones coming up every day, and being funded by VCs and going through, you know, accelerator programs like Y Combinator. So it’s not realistic anymore for any data team really to find one tool that does it all that they will actually be happy with in the long run, because you’re going to be missing out on a lot of these new improvements. But with the data space, having turned from one big application with full visibility and control of every aspect of the data stack into this world where you have tools with a really narrow focus that need to be kind of individually integrated between them, in many cases manually by data teams. But what has gone missing is this this sense of a unified unit called the data stack that can be reasoned with as a whole, it can be verse controlled as a whole, it can be end to end test it and that can be experimented with and played around with without worrying that there’s some SAS thing running somewhere that doesn’t have this concept of an isolated environment. So the way we’re seeing the world now is that there’s a really big opportunity for a new foundation, a new layer in the data stack that we are calling the data ops operating system that forms the foundation of every team’s ideal data stack. That’s how we’ve described our vision. What that means is that these best in class open source components like a Singer or an air bite for ETL, a DVT, for transformation, you know, rotor stack or similar tools for reverse ELT, superset, MetaBase, etc, for for for BI analytics. And of course, also you have all of these data science tools like Jupiter can that can be brought in that are also part of the data stack. We want all of this stuff to live together and be defined in a single repository in a declarative way. So that a team can reason about their data stack again as one unit and get these advantages I was describing. So compared to data offset from the past, the big difference in Meltano, is that we are modular by, you know, from from from first principles and architecture. And if we want to earn a new place in the data stack, instead of trying to replace something existing. And we call ourselves a data ops OS, because what we care about a lot is in kind of merging these worlds of software development and data engineering, or at least allowing them to cross pollinate and learn from each other more. Because we think that a lot of work that we currently call data engineering is really data stack development. And it’s far closer to software development, where you’re also picking, you know, off the shelf components, custom components, or some open source technology, you might be using some sass that you have to connect with over an API. And we are trying to allow data teams to start treating their work more like software development and get those same advantages. And our fact is sort of, you know, prepare for us a little bit by DVT, already making analysts more comfortable with some of these concepts. And we are trying to go old away and bring data ops not just to be L in the in the case of Meltano has been over the last year or two T s s DBT is doing but to the entire data sec. And we think latest x can be better than the sum of their parts, if you bring in Milltown to help, you know, manage it all and help the integration between the different components. Oh,

Kostas Pardalis 43:37
that’s great. I have one last question because I start feeling like really bad that I’m monopolizing the conversation here.

Douwe Maan 43:45
Oh, you’re not I’m pretty sure I’m talking way where you are. But yeah, your colleagues should talk to. Yeah,

Kostas Pardalis 43:50
exactly. And I’ll wear like my engineering hat. And I’ll make like equation to get data out. So what’s the difference between like, the data or operating system and something like airflow?

Douwe Maan 44:04
Yeah, that’s a great question. So one big difference is that in your data stack data movement is kind of the domain of airflow and similar work for low orchestrators, like, you know, a DAX, or a prefect. But, and they within their workflow orchestrator, have, of course, reach out to different tools that handle parts of that workload. But there’s more to the data stack than that you have a BI tool at some point, you might have, you know, tools that don’t really fit within the airflow airflow way of working. And if you’re using airflow, you still have to install it somewhere and deploy it somewhere in minutes your your version control of your orchestrators. And similarly, if you’re using a BI tool, you still have to install it somewhere and manage your dashboards and version control dose. So Meltano four was essentially the package manager for your entire data stack that all of these things can be brought into even things that are completely out of scope for airflow which only cares about data movement. For example, so much allows any team, your data, any tool your data team uses, whether it’s the analyst or the analytics engineer or the engineer, whether it’s about the movement or the consumption at the end, they form part of a greater product, where in some sense, the end users are your colleagues within the company. The interface or the features are some of those consumption methods and dashboards. And in the backend, so to speak, is more where flow lives. But that front end and the whole product is what Nathanael brings together by forming a package manager for every tool that it has stack. Which, from an engineering perspective, you can also see as a TerraForm, for data stacks, because we allow people to really easily bring in tools, declaratively are with the CLI and a Nutana, manages to configuration and deployment and all of that stuff. So that’s an engineer that wants to put together a data stack doesn’t have to pick six tools, learn how to install them, learn how to configure them, and then be the only person that team really knows how it all works. We want to also sort of democratize that make it give it a single source of truth at the entire team feels comfortable collaborating in and also trying out new tools, swapping out through tools really easily by giving them the confidence that if not Daniel has support for your tool, adding it trying it locally, or wherever it’s going to take just a few minutes of work, instead of this daunting task of figuring out how am I going to integrate it, maybe just want to Docker, maybe just want to spice and maybe just one is NPM? We want to unify all of that.

Kostas Pardalis 46:19
Yeah, that’s great. Eric, all yours. I have to apologize, by the way to both of you. Because I just realized that based on the outline of the conversations that we have created before we started the recording, like the stuff that I asked was different. So

Eric Dodds 46:35
that’s great. That was awesome. I learned a ton.

Douwe Maan 46:38
Like I said it would be an organic conversation, right? Forever goes.

Eric Dodds 46:42
Yeah, I know. We’re close to time here. But a couple quick questions. So one is how much of what you just talked about? I know they’re sort of park vision. This is where Meltano is going. How much of that exists today. I mean, how much of that can you actually use today?

Douwe Maan 46:59
Bonus the perfect question. So architecturally, even during the year or so to panelists talked about and perceived as an ELT tool. Meltano was always this, this, this this plugin based architecture that allows different open source tools to be brought in. So from a software perspective, we’re essentially already there, the only thing we’re selecting is specific plugins we support. So far, we have invested really heavily on support for Singer’s Epson targets for El DBT for transformation airflow for orchestration, at and the biggest challenge for us now is to kind of keep building out in the breadth of types of plugins we support, and of course, the level to which we support each individual plugin. So in the very near roadmap, we will be investing a lot in DBT integration, we already have to make it as good as it possibly can be. And at the same time, we are investing in bringing more parts of the data stack and a life cycle intimate data. So very quickly, very soon, we are going to release support for great expectations within your Meltano we are looking at superset and light dash some of these BI analytics tools that you can bring into your katana project and manage and configure consistently with every thing else. And similarly, we are looking at open source and reverse ELT solutions like router stack like guru, and a number of others. And even on the E L side just to kind of show to the world also that we are not just here to push Singer or to push the PT, we plan to support five trends through an API connection. And even airbike is in scope for us even though in our previous kind of how people thought about Meltano, it would have looked like a direct competitor. But from day one, we have been building an end to end platform to make data officer reality. Originally, we thought we could do so by just building one platform that does it all. We’ve come to realize that it has to be plugin based. And in that new worlds, we leave it completely up to date at teams what tools they want to use on top of natano, we just want to make sure we support all the current kind of popular investing class tools, make sure that data Ops is somewhat possible with them version control and all of this stuff. And we don’t really care to be a kingmaker for one particular technology. So over the coming months, especially q1 of the following coming, we will be kind of building out this this broader and deeper plugin support, as well as data ops specific functionality, like isolated environments, end to end testing, and a lot of these things that software developers have already been using. And we have to just figure out how to make them work with data and data tools, and how to explain them in ways that will resonate with data professionals. So this is all going to pan out over the next three months or so. But you know, we have a Slack community of more than 2000 people right now that are with us on this journey and are giving us feedback every day are giving us contributions to make it on this path. So I would like to suggest to the people joining us, of course keep an eye on the features release over the coming months. But if you want to be part of this conversation, and you want to shape the data tooling of the future and be part of this, this wave that’s going to make the data teams as effective and productive as software development teams have become over the last 10 years through the introduction of DevOps then in autonomous that community is the place to be and just a very quick pitch as well. We are also hiring both Engineering and marketing. So if you go to McDonald’s calm slash jobs, you can you can look at ways to help us out. We are all remotes for hiring across the world, and we pay really competitively everywhere. So check us out.

Eric Dodds 50:10
Awesome. Well, Tao, this has been such a fun episode, really appreciate you sharing some of the backstories and incredible story in six months going from, you know, being the loan project manager, or product manager for an internal product, you know, raising around and becoming CEO. So congratulations, incredible journey. And we’re excited to see where you take it.

Douwe Maan 50:32
Thank you so much, Eric. Yeah, I think there’s some stuff we could keep talking about, like Costas already mentioned. So I think we’ll have to come back maybe q1 of next year, when we have made some more progress in that eight observations about how that’s standing out. And we can also spend some more time talking about, yeah, the transition from, from an engineering manager inside gift lab to a CEO, that’s definitely been an opportunity for myself to run into my own kind of limitations, and then fast assumptions that don’t go anymore. If we could easily fill an hour, just like that topic alone.

Eric Dodds 50:58
Great. We’ll definitely do it.

Douwe Maan 51:00
Thank you so much.

Eric Dodds 51:01
Thank you. That was such a unique individual in that he has a depth of knowledge across such a wide variety of subject matter. And I think that’s certainly been accelerated by him taking on the role of CEO and Meltano. This is my takeaway from the show. There’s the old adage, I think, from the Netscape fundraising story, I think it was that, you know, you make you’re successful in two ways you bundle or you unbundle. And I’ve been thinking about that a lot lately in the data tooling space, because there are companies actively trying to bundle and actively trying to unbundle in general across tooling, but then also within specific disciplines. And thinking about Meltano, as sort of the package manager for the entire data stack, is a really fascinating way to bundle and I think opens up a lot of opportunity for them that a lot of other companies aren’t going to have because they don’t have to necessarily make choices about specific tooling. And so I know, I’m going to be thinking about that all week, because, you know, it’s sort of a very unique approach to bundling or I guess bundling is, you know, an interesting way to describe what they’re doing. So how about you cost us?

Kostas Pardalis 52:23
Yeah, 100%, I agree with you, it’s very, it’s very interesting to see like platforms like these are giving. And at the same time, we have a team behind it that, you know, has like the best possible pedigree to succeed in this because they are coming like from from GitLab, right, where that’s exactly what they were doing, like building this kind of tools, but for software engineering. So I’m very excited to see how they are going to move forward. Hopefully, we will have him on other. So like pretty soon. So because things are like changing really fast. But I would also like to add that if they succeed in what they’re doing, I think we are also they are also going to act as a great accelerators for the open source projects, which is very interesting, because we have open source projects with varying degree of maturity, let’s say especially when it comes like to the ETL part with all the connectors and all that stuff. So putting in place like something like Madonna, and also the governance of like Madonna brings with all the initiatives around open source, I think are going to see these communities actually maturing much, much faster, which is nice, because me as a person who has experienced, let’s say, the birth of SR, then Golding got into like some kind of winter situation where it was like existing, but not existing, made dangers maintained. And today’s things like all these October with Madonna, being the leader like to revive the project and govern the project, like in a way that’s going to be valuable. It’s super, super interesting. Like it’s very fascinating. And I’m really interested to see like what’s going to happen in the next couple of months.

Eric Dodds 54:10
And we’ll definitely have to have now back on the show, because we’ve barely scratched the surface on several subjects. So thanks for joining us again on the data sack show and we have lots of great stuff coming up. So make sure to subscribe, and we’ll catch you on the next one. We hope you enjoyed this episode of the datasets show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me Eric DODDS at Eric at data stack show.com. That’s E R I C at data stack show.com. The show is brought to you by Rutter stack, the CDP for developers learn how to build a CDP on your data warehouse at Rutter stack.com