Episode 64:

Data Stack Composability and Commoditization with Michel Tricot of Airbyte

December 1, 2021

​​This week on The Data Stack Show, Eric and Kostas talked with Michel Tricot, Co-Founder of Airbyte. Michel and team are building the new open-source ELT standard for replicating data from applications, APIs & databases. During the episode, the trio discuss Michel’s career arc, what size of company requires more infrastructure, and why Airbyte is open-source.

Play Video

Notes:

Share on twitter
Share on linkedin

Highlights from this week’s conversation include:

  • Announcement: Data Stack Live! (1:00)
  • Michel’s career background (4:13)
  • Solving the technical and process challenges of moving data (7:04)
  • Lessons learned from managing data at Live Ramp (9:35)
  • How to build a modern data stack (16:19)
  • Triggers to signal when more data infrastructure is needed (23:19)
  • Why Airbyte is an open-source product (30:23)
  • Airbyte’s role in providing support to open-source problems (38:15)
  • How important DPT is for the Airbyte protocol and platform (41:03)

 

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Automated Transcription – May contain errors

Eric Dodds 00:06
Welcome to the data stack show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run a top companies, the Digitech show is brought to you by Rutter stack the CDP for developers you can learn more at Rutter stack.com. Welcome to the dataset Show. Today, we’re talking with Michel Treecko. And he is one of the founders of terabytes and air bytes moves data for companies, really interesting company, they’ve grown a ton. I think they’ve been around for a year. But he has a pretty long history in data. And this isn’t going to surprise you costus He worked at a company called Live ramp which anyone who knows marketing knows live ramp, they do a ton of marketing and audience data. And so of course, I have to ask him about that experience. He was there pretty early, I believe. And so I want to hear what it’s like to talk to a data sort of engineer, data integrations leader at a marketing data company like live ramp. So that is, I’m going to try and sneak that in if I can. Well,

Kostas Pardalis 01:15
first of all, I love your friends auctions. We have to get more friends people on the on the show.

Eric Dodds 01:21
It really is. Yeah, Alan, I remember our first French guest, Alex, and I just loved hearing him talk about data. It was great.

Kostas Pardalis 01:30
Yeah, so it was very interesting to hear like, I can speak French anyway. So what I’m going to ask him, I mean, for me, like it’s a very special episode, because we are talking about the person who’s building like a data pipelines company, right. So there are like many different things that I’d like to ask him. But I think the most important and the most interesting part is the open shores dimension of how building a community is parts of the product, and how these can actually become some kind of, let’s say, moat also like for the company. And it’s very interesting in this case, because you have to understand that like airbike came in a time where, let’s say the market of data pipelining was supposed to be it was done, right? We had five grand five grand one like, was like, it’s probably the biggest vendor like right now, suddenly, you have like a bike coming, doing something like different, and these cars impact. So I think it’s going to be very interesting to talk with him both like from a technical perspective, but also like from a business perspective.

Eric Dodds 02:33
I agree. All right. Well, let’s jump in and talk with Michelle. Good. Michelle, welcome to the dataset show. We’re really excited to talk with you today.

Michel Tricot 02:43
Hey, Harry, thank you so much for having me. All right. Well, you’ve

Eric Dodds 02:46
been working in data for a really long time. Can you just give us an overview of kind of where you started, what you’ve done? And then what you’re working on today at AIR, right?

Michel Tricot 02:57
Yeah, sure. So I’ve been in the data space for the for the past 15 years, started my career in financial data. So say, medium volume of few 100 gigabytes. And then in 2011, I moved in the US and started in this back in the day small company called LiveRamp. And was able to experience like the hyper growth from finding product market fit to getting it and to getting to an IPO to getting an acquisition. And I was head of integration and Director of Engineering over there, I was leading a team of 30 people. And we have 1000s of different data integration and data to show it’s basically how you take data from one place. And it comes also about how you get data into another place. And we’re moving

Eric Dodds 03:43
on integrations back then is huge.

Michel Tricot 03:46
It is it is I think it’s we got we got burned quite a few times it’s a it’s a very hard problem. But in the end, what you need to do is really thinking about how you build how you maintain and how you scale. And it’s just that these pipes, you keep having more of them as they keep becoming bigger and bigger. I think when when I left in 2017, we’re moving hundreds and hundreds of terabytes of data every single day. So I had to learn the hat how to learn how to build this system to the Holloway.

Eric Dodds 04:14
Wow. Yeah.

Michel Tricot 04:17
And yeah, and after that, after later on, but joining those are startups started to do the same thing, which is as you get data from point A to point B, and I was okay. Why if I go for the crazy idea of solving it for more than just one company at a time, and that’s how John and I studied elbaite. So helping people move data from point A to point B without having to spin up massive data teams to do it.

Eric Dodds 04:45
Yeah. So I’d love to hear just a little bit about Well, I was watching a movie with with my son the other night, and it’s an older movie from the 90s I think. And there’s there in this lab. and a piece of equipment breaks, and they said we lost all 30 gigabytes of data from this experiment. It’s just so funny because you’re like, Man, that was like catastrophic back then. But So you went from a couple 100 gigabytes to hundreds of terabytes a day. Can you just explain? I mean, I’m sure some of our listeners have gone through that, but probably a lot of them haven’t like, what are the key things that you sort of took away from that experience of sort of this exponential growth and magnitude of just trying to move data?

Michel Tricot 05:35
Yeah, the key thing about moving data is, you need to think about it as a almost as a factory. Which is, it is not just a technical challenge, it is a process challenge. And it is not something that you only solved with that the only service code or unit salary, some software, it’s also something that you need to solve with people. Because when you think about the the amount of places where you have data, it’s impossible to write software that can get everywhere. I mean, you’re going to spend years and years and years doing it. So for a single company, it’s very hard. So it’s about like, how do you set up the right process, so that you enable people to actually be able to pull their from there and you start dispatching the responsibility to more and more people and you can actually almost like crowdsource the maintenance, crowdsource the building, and crowdsource the scaling of this of this collector. And once you start that also, the other thing is, you need to think about it in a sense that you be not building a system that work 100% of the time, you’re building a system that has to be resilient, that’s the thing is there are connectors, there integration, it always breaks one day or another. And you need to build your system with this in mind, because I mean, in the end, you depend on a ton of external places that you have no control on. I mean, I don’t know like tomorrow Facebook and decide to change how the API behave, or it can strike with this does the same, and you don’t have control on their products decision. And you need to make sure that the system you build is resilient to it. And you have the process in place to to solve.

Eric Dodds 07:21
Yeah, interesting. I remember there were a bunch of companies that were sort of built around an API that Facebook had made available, that sort of made it really easy to gather large amounts of information on sort of individuals. And they changed it overnight. And like 20, startups just evaporated. I mean, that’s kind of an extreme example. But even minor changes, especially if you think about enterprise scale, can make a really big difference on a daily basis, an E commerce company that relies on data to sort of understand conversion, or sort of send repeat purchase emails or other things like that. I mean, if something breaks, it literally costs huge amounts of revenue. One thing we talked about, as we were prepping for the show, which I would love for you to speak to, so you were a live ramp A while ago, and you were dealing with data at a at a really large scale. And I think in many ways, like a scale that a lot of sort of your average, like data engineer, data analyst doesn’t get to experience just because it was such a huge amount of data. But that was still a while ago. And so have you seen sort of the lessons that you learn there translate? I mean, the technology stack is very different today than it was back then, in some ways. But there’s sort of a trickle down effect from companies that you are sending audience data, hundreds of terabytes a day. What’s the trickle down effect? And how long does it actually take for sort of the problems that you solve to hit the average sort of data engineer or show up as tooling for the average data engineer?

Michel Tricot 09:01
Yeah, that’s a very good question. Well, the thing that is that I love looking at right now is when I look at the data’s landscape, and how things are moving, and the new type of product, is you look at who are building these tools, who are building these products, and you realize that all of them, like most of them, I’ve had this problem way before. And it’s just that we scale, you encounter new challenges that you have to solve the hardware, because there is no solution that exists on the market. And because it’s data, it just grows exponentially. So everything that we’ve learned 10 years ago, was something that was specific to labor, or it can be specific to Google or specific to Facebook, specific to Netflix or any of these big company that were built and that really became massive at that time. And engineers there had to learn what kind of technical asset what kind of technical skill they had to build. And once they are out of these companies, they realize that a but there is actually scaling exponentially. So all these other companies that don’t have the same volume of data, actually, but to face the same type of problems. And with this technical knowledge of how do you actually solve this problem, now you get this new generation of products that allow the more common consumer or the more mainstream consumer to actually be able to be very, very good with data. So it was more like we are the early adopter. And now in this the land of the, of the mainstream.

Eric Dodds 10:37
So and I’d love for you to talk about sort of, okay, so five years ago, you’re solving huge power. You know, however long ago, that live ramp or an engineer solving is at Facebook or Netflix. And so they sort of learn the fundamental components have the problem from an engineering standpoint, and then at the same time, technology is advancing. Right. And so then when they leave that company, and sort of encounter the new technology, that’s there, is that sort of the point where they say, okay, I can now build something that solves this in a way that sort of meets needs of the mass market in a way that wasn’t possible before?

Michel Tricot 11:17
Yeah, that’s correct. I mean, just think about warehouses, for example. I mean, 10 years ago, you had some warehouses, but most of the the analytics was built on on Hadoop at scale. And it’s just that people using Hadoop started to realize, okay, that’s not the best system. So on top of that, you start putting hive, you start putting more and more layers. And one day, you have BigQuery, yesterday, you have snowflake at people who have taken the analytics to the next level. And now for all these engineers will be working with this technology that say, Okay, I have this amazing processing engine. What was I doing with all this complex system that is now becoming much simpler, or that can enable more use case by using a data warehouse. And it’s just Yes, technology is just growing. And it makes like creating this product more, like easier and more approachable for for, like, maybe less data, heavy companies. And, for example, we always talk about the modern data stack, like, you do extract, you do load, you have your warehouse, you have your analytics, you try to orchestrate all of that with airflow with transformation with DVT, etc. But if I look at it, in 2014 2013, with redshift, we were, we already had exactly the same system internally, just this type of system becomes more mainstream, and there is more tooling that so that you don’t need a huge team to build all the tooling around it. I mean, at the time, I don’t know if airflow existed, but we had our own workflow manager, we have our own transformation manager, it’s just the people coming out of competitors actually building that so that it can use be used by more and more people.

Kostas Pardalis 13:05
Still, you mentioned the Hi, by the way, I’m sure. But like after, maybe you can relate to that from your accent, but like after two years has I’m still struggling with a difference between kilometers and miles. So sorry, I’m there was sort of like for my delay, but I miscalculated a few things. So you mentioned the term like the modern data stack? What’s your definition of the modern data stack? Like? Yeah, what are the let’s say, like, the main components of it. And you mentioned also that, like, it’s not like, we are doing something new that we didn’t do in the past, right. But why it’s modern,

Michel Tricot 13:48
I think it’s small. It’s something that is enabled by technology, which is the composability of your system, what we’ve seen with back in the day with Hadoop with Spark is that you have a very monolithic way of, of working with data. And with more and more tools being added, I mean, if you look at all the Apache project, basically, most of them are about about data, like, it’s just all these little tools that are coming on top of it. And the modern data stack is more about how you go from an end to end solution to something where you use the best of breed for every single piece in your data value chain, because data is so tied to your business that generally using I do I do everything solution doesn’t doesn’t work, you get to 70% of what you need. And then you go a little bit outside of how it was thought about, then you need to build your own parallel data system. And for me, the data the modern data stack is more about the composability of a system. And the fact that if as your business changes evolves, you can start add more and more building blocks and you have the choice between like, picking which vendor which solution you want to use, and it’s a matter of SME. And that’s why I like product like airflow like prefect, and others are becoming so powerful because they become also a bit of they, that’s where you encode the logic of how you glue all these different tools together.

Kostas Pardalis 15:21
Yeah, that makes total sense. And what are like the the main components of the data stack? I mean, you mentioned like airflow, for example, which is the orchestration part and somehow, like, glues everything together. But what else is needed there to calculate, say, the minimum viable data stack?

Michel Tricot 15:38
Yeah, I would say ingestion, processing, transformation. Visualization.

Kostas Pardalis 15:45
Mm hmm. Okay. Make sense? And then, you

Michel Tricot 15:49
know, with maybe a little bit on the reverse ETL side, which is, how do you actually activate the data back and put it back into a place where it can be activated?

Kostas Pardalis 15:59
Yeah. What about the mayor? What do you think that these feats are? It’s something that’s like, okay, let’s have the data stack first in place, let’s solve the basics. And then we move into like, say, the more more advanced like envelopes, and like, all these things that you see, like coming up right now, like with products like Tecton, with feature stores and all that stuff?

Michel Tricot 16:20
Yeah. So I put that in the activation path. So whether it’s about making the data available, as well, whether it’s about an operational use case? I mean, yes, you’re the operational use case. And I think also, that’s why it’s not just about analytics. Yeah. And that’s where the all the all the orchestrator are important. Because as you said, you have like ML, you have quality, you have a lot of things that you might want to do is just depending on your business, you might or might not need that particular function. And he’s just about where does that fit in your pipeline? But yes, that’s part of it is just the composability of all your data value chain, from beginning to the end product.

Kostas Pardalis 17:04
And where does a bike fit now, today. So today,

Michel Tricot 17:09
we fit on the ingestion and loading piece, which is just breaking down silos and making sure that you don’t have to think about the physical and technical complexity of pulling data from one place and feeding it into another place.

Kostas Pardalis 17:29
And what is your bite going to be in the future?

Michel Tricot 17:34
So the first thing is, the goal today is really about commoditizing data integration. But when you think about data integration, there is a purpose behind it, which is moving data around, you have data on point A, you want to get to point B, and that is the largest division behind him being ever is, how can we make sure that there are pipes that allow the data to flow and to get to the place where you’re going to be is going to be the most valuable for the for your organization. And it’s not about extracting insight, it’s not about visualization. So about transformation is just, let’s focus on having a perfect movement of data. And that could come also with, like, adding quality on top of it adding a lot of additional feature to make sure that you don’t just have pipes, but you have smart pipes.

Kostas Pardalis 18:25
Okay, yeah, that’s very interesting. And do you think that, I mean, you mentioned like, composability, right. So we have like, all these different like parts of the data stack, and like, we try like to make them all to work together. And that’s why data engineers have so much demand right now. So you mentioned quality, quality, right? Now you have like all these different products out there that like Montecarlo, big guy, all these new guys like are entering like the market that somehow like in order to let him deliver value they need like to work very closely like with another part of the data stack, either. It might be like the data warehouse or it might be like the data pipelines, right. Do you see in the future quality being part of like some power some more from the middle part of the data stack? Or you see like a different category like remaining there? Like what what do you see like happening there?

Michel Tricot 19:24
I think it’s a matter of who is using it will get value from it. Where I think quality can exist at multiple layer, which is you can have physical quality. Is there a missing field? Or is Is there like a lower volume of data? And that could be done at the pipe level? But then you might have business quality, which is is the sum of my revenue? Less than $200,000? Yep. And industry Filing composite, which is so important is your companies will learn what is important to them. And quality is just something that we put a different means the same thing when you’re thinking about factories, you have quality checks in multiple places, because that allows you also to know where you have a problem. So, yeah, quality is just only present in the data stack, and it just won’t get value from it. So companies like big eyes, or others, they need to be there because you have so many people that are interacting with the data warehouse that know what is good data, and what is bad data. And that need to have the tool. I think maybe one thing we didn’t talk about for when we talk about the modern data stack is it’s about making data a platform, instead of something that is fully controlled by data engineers. And once you start exposing data as a platform to the rest of your organization, then you need to have more than one tool for doing quality at different step of the of the pipeline.

Kostas Pardalis 21:03
Yeah, 100%. I think that’s like, very, very, I would say obvious when you enter like data quality in the middle space, where like the tools that you need to use there to like, figure out if you have to retrain your model, for example, like do like stuff about the model itself. And like trying to figure out if something goes wrong there. Like it’s a completely different kind of beast that you have like to work with. So yeah, I agree. I think we’re just like at the beginning of like forgiving doubt going to be honest. Like it’s, it’s a huge issue. And I’m very curious to see what else will come out there like in the in the market?

Eric Dodds 21:40
I have a question for now, I have a question for each of you, though, because and I’ve actually, this is cost this might be tired of hearing this. But I think it’s a really interesting question for listeners, because they just span a sort of enterprise to start up. But show when did it when do you talk about composability, we think about quality, size of company. And sort of complexity as a proxy for size has a really significant influence on the pain you feel from sort of lack of data quality, right? So example is when you’re I heard someone say like, Okay, what, what is your analytics when you’re a two person startup in a garage? Like you just directly query your Postgres, like app database, right? And you learn everything you want to know, right? But then when you’re 1000, person company, that’s a completely different game. And like you said, you sort of have, you need to pull data from many sources, you need to do transformations on it, there’s a quality component, there’s visualization, and then sort of the activation side of it. Is there what are the the triggers that you’ve seen, that are sort of indicators that people need to address those issues. And I mean, also caveat that by saying, in an ideal world, I think smart companies try to solve these problems sooner with good infrastructure, good orchestration, good data quality practices, but I think anyone who’s been inside a company knows it’s really hard to do that while you’re growing a business. So how to size influence all of these factors that we’re talking about, it’s

Michel Tricot 23:15
so first of all, is just a matter of how much context and we’re interacts with with your system. And that’s why like, compose a beat is so important that before, it was an easy person that was working with the data, as your organization grows, not just your data that goes is your team is growing. The people that are interested in data is brain, like you might have marketing that wants to know something about their you might have sales, you might have finance, you might have product, and they all want something with data. And that generates complexity. Because they don’t have the context about all the data is flowing through these pipes. And that’s why when we think about the modern data stack as becoming a platform for other roles, that is the complexity that needs to be fixed. And that’s why composability is so important. Because you don’t know, tomorrow, we’re gonna need data to make your organization better and go faster. And so at that point, you want to make sure that you bring a system that is not just frozen in time, but it’s one that can actually evolve with your company and with your teams. So and of course, that comes with complexity. But in general complexity can be not addressed, but can be made simple or simpler with more composability. And more choice.

Eric Dodds 24:41
Yeah, that’s fascinating. We had a guest recently who made an observation I’d been thinking about a lot where they said the move to the cloud was supposed to simplify a lot of things and it did simplify a lot of things right deployment, sort of managing on prem stuff, right. But he said it’s made the the tech stack, way more complex. Because everything’s easier. And I think it’s such a good observation that complexity is not driven primarily by technology or only by technology, but by demand for data inside the organization. And the lack of context is a huge challenge there. That’s such an interesting observation.

Kostas Pardalis 25:21
Other percent, I think, first of all, what you said, Eric, about, like, what you’re not allowed to analytics look like when you’re like in the garage, and you just have like a Postgres. I would argue that, like, that’s not real anymore. Like if you think about it, and because of the cloud, right? Like, even when you just start, like, you will have probably some data on Google Analytics, you might have some, you will run some experiments with some ads, you will probably have like a basic CRM, or at least some Google Sheets, where you keep track of certain things, right? Yeah. So my feeling at least is that like, more and more smaller companies will need something like air bikes, really shown like to use it and get like the value out of the data that they have. I think why size is important. And market matters is because of organizational complexity. Like that’s where things like get really messy, because suddenly, as you, as Michelle mentioned, like, you don’t know who else like inside the company is going to need the data. But at the same time, it’s much harder to communicate any issues with the data or fix or identify when there are just like two founders, and there is like a problem in spreadsheets, they just talk to each other and they fix it right? Now think about like a company that you have to collect the data that might be edited by salespeople in Salesforce. And then when you take this data, there are some analysts that they go clean it and create, like some dashboards and then the data scientists will make these dashboards and based on that, we’ll create like a subset of the data to go and create a model and build a model that’s going to magically, I don’t know, come up with some numbers. And then the data engineer will go take this data and push it back to the Salesforce for example. So the salespeople again, they can do something like just think about all the different departments are talking already. And how difficult it is like to communicate with all these even for like much simpler problems than the data that is has to be moved from one place to the other. So I think this organizational complexity is like super important. I think that’s one of the reasons that like you have like some influencers, let’s say like in the space like the guy from like a local optimistic where he posted the post, where he said that the problem about data, like it’s an organizational problem, and it’s not like taking up problems and those things, you have also this model that Michelle is talking about, like the different parts of let’s say, the supply chain of data where you need like quality, for example, like in different parts, and someone else cares for each one of these, right? Anyway, I think we’re still at the beginning. And we’re still scratching the surface of, of the complexity of building like, at the end, a data driven organization, it’s going to be very, very different working with these systems compared to building mobile apps for John, like the complexity is very, very different.

Michel Tricot 28:18
When a new hire for for data is just by opening and making it more accessible, you discover a new thing that you can do with it, and it’s going to continue to grow and people are going to become greedier and greedier for having more data and make better decisions so intimately to do we’ve worked already with that much data, have an edge on the type of product that can be built to enable is new, like this new generation of data of data consumers. Yeah, Michelle,

Kostas Pardalis 28:49
I want to go back to her bike a little bit, because like, we can be talking a lot about data in general, right? It’s we can have like multiple episodes, the three of us like talking about that stuff, for sure. But I would like to share a little bit more about the products in the combine like with our audience. So airbike is like an open source product, why open source, why it’s important and how it has helped the company grow?

Michel Tricot 29:14
Yeah, one thing that I was mentioning before was really, that solving data movement and solving data integration is not just a code, or like a technology problem. It is a process and people problem. Because when you look at all existing solution, they generally plateau at an amount of connector that they can support. And the reason is simple is it’s very hard for one single entity to manage that many connectors because that’s the problem with data connectivity. That’s why it’s very hard problem to solve. You have so many places that you can you don’t know what are all these use cases. And at that point when we thought about open source that That was basically because of that, it’s, it’s something that needs to be built. And that needs to be almost like crowd sourced, you want to make sure that you have more than one company that have the control on how you actually move data around, and what kind of connector matter because building, it’s relatively easy. But maintenance is where the cost is. So at that point, what you want is, you want to give the power to people who are using the platform to actually solve the problems when it Wait, right, because if they’re using a crucial solution, if they have a problem, they will have to wait four weeks. So in that case, what they will do is they will start building it internally. But then when you build it internally, this becomes a gigantic monster that grows and grows and grows, and it’s out of control. And here you have access to something that sometimes you need to fix. And the rest of the community has access to it or someone else from the committee fixes it, and you get access to the fix. And by creating this very vicious cycle, then you get more connectors, and you get more people that contribute and that actually have a seamless experience with with data integration. So open source was really about solving the people aspect of it. Yes, open source is also technology. But it was really about let’s build a community. And let’s make sure that we make data available across the community and users have Yeah, right.

Kostas Pardalis 31:27
Right. So okay, that makes total sense. I’m in my part of the connectors themselves. So before our bite, there was a singer, I mean, still out there. And they know that their bite like as a protocol, which is actually an extension over singer, what is your by doing that is different, what singer deeds and what’s teach the company behind SR before they got acquired, at least they?

Michel Tricot 31:54
Yeah. So interestingly, when we started a byte, we started to build on top of singer. So we discovered some of the flows is the thing is, the team has a lot of experience in building their integration. So we saw flows in how it was. And we have a compatibility with singer to make sure that people who have invested time of their team into singer can also leverage these connectors within a byte. But in the end, the protocol I want to put them are like the guidelines of see way too permissive. And that breaks the contract of solving data integration by having almost like pairwise compatibility. And today, you have this absence of rules and this absence of guidelines, then, you’re basically building one to one pipelines. And that’s all and you get to this n square problem, instead of n times n, n plus m. Problem. So that’s what that was what we saw with singer also, the community of singer was very, I mean, I’ve just got acquired by 10, I think 10 and dropped the ball on a year and a community like that needs to, you need to really invest in it. And that’s something we’ve done very, very early on, like, one of the first hire we added at Airbus was really someone who is here with the community and helping them be successful with open source. And because we started a year ago, so obviously, the first version were pretty unstable. So having someone to just help every single person in the community was was very important. And we’ve continued to grow that function and make sure we have seamless experience on open source. But it’s just that if you don’t support your community, you cannot build that network of people who just help each other and build and maintain connectors together.

Kostas Pardalis 33:45
So you said like, whereby it is actually fixing some of the issues that had not stayed? Sure. Singer. So what are these new elements that the air bites not exactly a protocol, let’s say guidelines, or whatever we want to call it brings that singer didn’t have?

Michel Tricot 34:01
Yeah, so we actually call it our bait protocol. For the reason that we have very strong, we’ve encoded a lot of behavior, and, and logic, and there is like, almost like a specification, how you build it, and what messages should look like. But there are a few things. The first one is a bit is you don’t have problem with environment. That was a big problem with singer which is you want to use a tap, you don’t have the right pattern, you don’t have the right C library, the right bindings, so 80% of the time, you need to do a lot more thing to get it to work. First thing. Second thing was it has to be provided programmatically configurable, which is an N It means that a connector should expose what kind of input it requires, like what does the state looks like, so that you can be smart then on the platform level and you can start building on top of connectors instead of hard coding behavior. We start, that’s something we’ve actually learned while using singer, which was, if you want to use a tap from singer, you have to read through the code, you don’t have a way to automatically know oh, I need an a cake in the start date, I need something. And that’s we made it part of the of the interface of the of the protocol. Now, the other thing was about being language agnostic. And that was very important, because if you look, for example, at data integration, not everything is an API, you have queues, you have databases, you have Kafka, you have a lot of things and very offensive in third with the programming language is going to be consuming there are pushing there to that. And I would hate to have to push data at scale on Kafka with Python, if I want to do it, I want to do it in Java. And so having the flexibility and being language agnostic was a was a very important requirement that we had. So these are like, I’m just summarizing. But that’s what kind of the criteria that we had and how we thought about data connectors. And it’s also like, if you want to grow your community, not everybody knows how to write Python, sometimes they want to write it in C Sharp if they want to. So they should, they should be able to contribute with C sharp, like one of the first contribution we had was in Alexia, like, I’ve never played with elixir ever. But sure,

Eric Dodds 36:24
wow. I mean, I know Alex is growing in popularity, but that’s kind of obscure.

Michel Tricot 36:30
Yeah. But in the end that, that it worked, and it’s just, it was really a proof of concept of how a bite can work with more than just one language, and can be used by people that have the talent, and that are using the tools that are the best fitted to solve that particular problem.

Kostas Pardalis 36:47
Yeah, I have like a question about that. Because, I mean, I understand they believe that there’s always like some kind of trade off between flexibility and quality, right? For example, let’s say the Kafka Connect, right, which is, let’s say another framework that you can use, like the grid connectors, of course, like specifically for Kafka, but the whole idea of like the community around and like all these things, like they are similar, right, but at some points you Azerbaijan, you will have or you want to ensure like the coal rights, how you can do that, when you have like so much freedom in terms of like how someone can pull something, or what framework they can use. Let’s say someone comes with a lecture, or someone comes with cascade and writes like something on Haskell, right, but like, what are you going to do as everybody with that?

Michel Tricot 37:36
Yeah, that’s a, that’s a very, very good question. So one thing that we’re working on right now is our contribution, it’s basically trading a new contribution model. And that is going to be powered by cloud. So what we want is, there will be a set of connectors that is fully maintained by air by week. And just that some of them are so cool, like typically database connector, we need to make sure we have very, very strong quality and like not quality, but very strong, save on the roadmap. But for the other ones that are not part of this subset of certified a white connectors, what we’re going to be doing is making them available on the vital and provide a Russia to the rest of our community, the people who are actually maintaining these connectors, whenever connection for the city. And then like community members can be like individual contributor, or they can be data agencies or they can even be vendors. So if you’re a vendor, and you want to create a new revenue stream, via our BI, that’s something you can do. And today with Oktoberfest, we got massive, massive amount of connector that contributed to our bytes, it’s free, people are really seeing the value of having this connector to run on a button. So there is this wheel and desire to be part of the program to get rev share as the connector becomes more successful. And at that point, you also have a nice balance, which is if someone stops maintaining it, or UGC is not there. Either this connector gets transferred to someone else, or someone is going to create a better one. So there is a bit of a of a race to some extent on making sure that the product is high quality.

Kostas Pardalis 39:26
Yeah. Yeah. It’s very interesting. And I’d love to chat about that again in the future. One last question from me how important is DBT for the nearby protocol, and forever byte as a platform? I’m separating the two so yeah, I so about that.

Michel Tricot 39:44
Not at all for the protocol. We use it more as a post processing piece on warehouses, but in the end, what is just making the data a bit more consumable when it’s being loaded, but it’s not required for the protocol. because just about configuration, data exchange and connection, that’s all in for the platform. It’s it’s small, who you’re talking to, and what you’re working with. I mean, most people, we work with our data analyst, data scientist, data engineer, analytics engineer. And if they don’t have an airflow running or like some orchestrator on top of it, they want to have a very simple way to kick off, like DVT jobs, whether it’s by using open source or right now we’re also working on how we can make it work with with DBT cow. But it’s more a handoff to the rest of the data sets, as I mentioned, were to be the best at just extract load and data movement. That’s all we don’t want to do transformation. What we want, though, is to have a way a mechanism to handoff what happens to the downstream system.

Kostas Pardalis 40:50
Okay, that’s super interesting. I could keep talking about like, for a long time, but I think we’re getting close to our time here, right there. If

Eric Dodds 40:58
we have a few more minutes, if you go, if you have another question, go for it. And then I’m good. I want you to flip. So, Michelle, one question. And I’m interested in your perspective on this. So of course, our listeners know I have a background in marketing, you are live ramp, live ramp has been a major player in the marketing data space for a long time. And anyone who works in data inside of a company knows that marketing tends to be the most hungry or one of the most hungry, sort of, or generates a lot of demand for data. They’re very hungry. And you mentioned that like complexity around, you give people data. And then there’s more demand for data because it creates more questions and more value. And marketing is a major consumer there. I’d love to know, when you think about marketing, a lot of times it’s sort of audiences, advertising conversion data, a lot of it’s happening on sort of the the client side or sort of actual experience, and then feeding experiencing versions back into the system to sort of optimize basically advertising algorithms. I know it’s more complex than that. But so that’s a huge need and marketing. What are the major sort of use cases? Or the biggest areas of demand that you see for companies that are using airbike? Or their particular types of data? Is it does it fall sort of to one department who are the most greedy data consumers and even use cases around that when it comes to earby?

Michel Tricot 42:24
Yeah, so I will say, I will give to so definitely marketing is a big one. But it’s rarely marketing by itself. It’s generally more like bigger initiative. And marketing doesn’t care so much about replicating product databases, but at some point, they realize that they need this information. And marketing is really a consumer, in the end, it’s about empowers them to just move the data. So they don’t even have to talk to a data team to Demeter work. We work mostly with the data teams to build a platform and marketing can serve. So for for the use case is going to be about, as you’ve mentioned, like attribution, 360 views of customer, so across all the touchpoints, whether it’s on the product, whether it’s on the finance, whether it’s on stripe payment, as like, how do you get this whole, like 360 view of your customers. Now, the other use case that we see a lot is on products that are actually building like companies that are building a product, and that need to have connectivity to the product. So if you look at ecommerce analytics company, they are good at measuring analytics, they are good at providing value to their customers. But to do that, they need to actually pull it out from Shopify, from Google or from from being from Facebook, etc, etc. Sure, sure. And they want to focus on the value prop, they don’t want to focus on the connectivity part of it. So at that point, we’re more in like operational use case, which is, will become the layer for them to acquire that data on behalf of their customers. And that’s that’s been a pretty big use case for us as well. But otherwise, marketing analytics is is huge. We also have product use cases, which is larger organs, like larger engineering team, or product team wants to understand, like, get analytics on Git commits, they want to have analytics on on peers, they want to have analytics on workflows causes demos, and they build their own internal tool or analytics to actually measure the efficiencies of their teams. I think by creating a protocol, it allows you to stay away from very, very specific use case that could narrow like the scope of your product. And at that point, if we only focus on the piece about data movement, then we can enable use cases that we don’t even have idea about like some people were using a buy to prime cash on Redis. Every hours, they would just drop everything on Redis and just read is only a database into Redis. And that’s something you cannot predict. But it’s possible because the platform is flexible and focuses on movements instead of silos. Hmm,

Eric Dodds 45:11
yeah. It’s yeah, it is super interesting. I think you talked about machine learning as an activation use case. And I think that’s a really helpful way to think about it. Because in many ways, if you think about really well done marketing analytics, that’s actually what you need to feed a machine learning model that’s going to drive business, right. And so it really is almost like you sort of get the marketing analytics layer, correct. And then that opens the door to machine learning, which is super interesting.

Michel Tricot 45:45
Okay, what’s what country? That’s where complexity come from, then?

Eric Dodds 45:48
Yes,

Michel Tricot 45:49
you answer one question. And now you have 10. More. Sure, exactly. And so you need more, you need more team, you need more specialization in how you extract insight. So yes,

Eric Dodds 46:00
yeah, for sure. Okay, one more question for you. And we’ve, we’ve talked a little bit about this on the show over multiple episodes. So as costus knows, I’ve talked about a world where, but can you imagine that all sorts of data movement and sort of processing aligns with a particular business model. And in a few clicks, you can basically set up an entire stack, right? We’re not necessarily completely there. But we’re getting closer. And you’ve talked about commoditization of data products a lot. So I kind of want to assume that a lot of these things are commoditized. What does commoditization number one, what’s your definition of commoditization? But number two, I’m really interested in know, what does commoditization unlock for us, especially for people working in the data industry? Because I think there’s still a lot of inefficiency, just because companies are figuring out how to build technology, things are getting cheaper, but at different rates. And so there’s still a lot of complexity or sort of froth as people would say in the market. But let’s assume everything gets commoditized what is that unlock. But first, what is what is commoditization mean?

Michel Tricot 47:13
Yeah, so first of all, I just want to go on one thing is, I don’t think every data product can become a commodity, what I’m saying is more about data integration, should be commoditized. Like the ability to pull the your own data and you’re fragmented their asset should become a commodity, it shouldn’t be something you have to think about is just it’s your data, you need to be able to move it where is going to have the most value for you. So when we think about commodity, that’s how we think about it’s just, let’s make sure that you can very quickly break down the silos. So that’s, that’s what we mean by by committee. And it’s also by the simplification, like how simple it is to use, and almost to a point where you shouldn’t have to think about the fact that moving data is something that is a problem for an engineer. And that’s what we mean at that point by now. I mean, I won’t call like a machine learning algorithm, a commodity, even though the box was there, I won’t call like, now processing becomes more of a commodity, but then it becomes the role of these these companies like, what kind of a dish like how do you build on top of commodity and typically for data movement, it’s about, it’s about quality. It’s about observability, it’s about, you have a lot of things that you can build on top of it, that makes something that is commoditized. Even more valuable. Yeah.

Eric Dodds 48:39
It’s like the infrastructure piece, right? If we think about executive movement, and then even I mean, I don’t know if you’d call snowflake, a commodity. It’s not really the language people use. But if you think about warehousing in general, you can set up a really robust sort of pipelines structure and warehouse really easily, very easily. Right? Those things are becoming commoditized, which is great. Like, it’s, it’s opening so many doors.

Michel Tricot 49:03
Yeah. And it’s just like community means is something that people believe to always walk. And that’s where, like, that’s where we want to be with data integration, and then becomes like, what intelligence did you build on top of this infrastructure? And what kind of additional value that rely on this fundamental, you can actually start building and what kind of use cases enables. We like to think about this as the Maslow Pyramid which is, oh, yeah, if your fundamentals are not there, I basically your fundamental is your commodity and you want to make sure that your fundamentals are addressed so that you can start thinking higher level and higher level with things that are even even more value that bring even more value to your to your business.

Eric Dodds 49:48
Yeah, I love that thinking about the data stack as sort of as an outsider who’s great, one last question for you. And I’m just thinking about people in our audience who are excited by learning from your experience Having such a long history working with data in a variety of contexts and now building a data company, do you have any advice for data engineers out there who are thinking about the future of their career and working in data?

Michel Tricot 50:14
Yeah, I would say think about trading. Creating create leverage enables other people to be good with data. I would say that’s the secret for data engineer, because you, you don’t want to be building data connectors, for example. Because that’s something that we discuss it which is, is going to take you a ton of time. What you want is, what can you do to actually enable other people to be extremely, extremely good with data? And what kind of tooling what kind of new technology you need to build to make people who consume data even better with data? Because that’s, that’s how you get your level with with the rest of your team. And that’s how, as an engineer, you can actually as a data engineer, you can you can really grow quickly, is think about use case, think about nav most people

Eric Dodds 51:03
incredible advice, will Muschamp I’m sad that we’re out of time, we’re gonna have to have you back on the show, because there are so many questions that I think both cost us and I didn’t get to ask. But thank you for joining us. This has been a great conversation, and we’ll have you back on the show again soon.

Michel Tricot 51:16
Thank you so much, Eric. Thank you so much process.

Eric Dodds 51:19
I think one of the most interesting takeaways from this show, we’ve talked about the increasing complexity of the data stack, no one has framed it in the context of demand from various parts of the organization. And I almost feel a little bit stupid, not kind of having thought of that as a way to frame it. But I thought that was a very elegant explanation of a main driver of the complexity, because it’s so easy for us to think about the tool. Oh, there’s a new tool for this, oh, there’s a new tool for this, right? CDC, streaming, all this sort of stuff. And really demand from different parts of the organization and their different needs are the main driver. And that’s a great reminder for me, and I hope everyone who’s listening?

Kostas Pardalis 52:05
Yeah, well, I think, okay, always, like Michelle is like, very good in like articulating pretty complex concepts, which makes sense. Like, I think it’s one of the skills that people who are successful in building like tech companies have, right. So I think it’s an indication of the success of her vitals so that he’s able to do that. What that will keep, like from our conversation with him is the concept of composability. I think that was a very interesting, like, way of thinking about what the data stack is. That’s one thing. The other thing that I found also like very, very interesting is that machine learning at the end is also activation, which again, did something about it don’t change. So just keep thinking of it like in a very different way. Right? It’s actually interesting, because if you think about like in the market, like the companies that they are doing, they are building products around like serving models and the companies that they are doing rivercity Me, although at the end, they are the end results, the same, in the sense of the need for the company is the same, like yeah, very, very different, like products and components. So for sure, that’s either like something very interesting, like to observe the market and see how it’s going to evolve as the market grows.

Eric Dodds 53:21
All right. Well, thank you for joining us on this episode. And we have many more great shows for you coming up before the end of the year. As we round out this season of the de sac show. We’ll catch you on the next one. We hope you enjoyed this episode of the datasets Show be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me Eric DODDS at Eric at data stack show.com. That’s E R I C at data stack show.com. The show is brought to you by Rutter stack the CDP for developers learn how to build a CDP on your data warehouse at Rutter stack.com