The Data Stack Show

with Kostas Pardalis & Eric Dodds

Conversations at the intersection of data engineering and business

About The Show

Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

To keep up to date with our future episodes, subscribe to our podcast on Apple, Spotify, Google, or the player of your choice.

Powered by RudderStack

The Data Stack Show is made possible by RudderStack, the complete customer data stack. Created by engineers for engineers, RudderStack makes it easy to build customer data pipelines that connect your whole stack, then make them smarter by ingesting and activating enriched data from your warehouse.

Podcast Episodes

40: Graph Processing on Snowflake for Customer Behavioral Analytics

In this week’s episode of The Data Stack Show, Eric and Kostas talk with the co-founders of Affinio Tim Burke and Stephen Hankinson. Affinio’s core intellectual property is a custom-built graph analytics engine that can now be ported directly into Snowflake in a privacy-first format. 

Highlights from this week’s episode include:

  • Launching Affinio and the engineering backgrounds of the co-founders (2:36)
  • The massive transformation in customer data privacy regulation in the past eight years (6:23)
  • Creating the underpinning technology that can apply to any customer behavioral data set (10:05)
  • Ranking and scoring surfing patterns and sorting nodes and edges (14:13)
  • Placing the importance of attributes into a simple UI experience (19:28)
  • Going from a columnar database to a graph processing system (25:20)
  • Working with custom or atypical data (32:46)
  • The decision to work with Snowflake (37:43)
  • Next steps for utilizing third-party tools within Snowflake (52:18)

The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription

Eric Dodds 00:06

Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Eric Dodds 00:27

Welcome back to The Data Stack Show. Really interesting guests today. We have Tim and Stephen from a company called Affinio. And here’s a little teaser for the conversation. They run in Snowflake, they have a direct connection with Snowflake, but they do really interesting marketing and consumer data analytics, both for social and for first-party data, using Graph, which is just a really interesting concept in general. And I think one of my big questions, Kostas, is about the third-party ecosystem that is being built around Snowflake. And I think that’s something that is going to be really, really big in the next couple of years. There are already some major players there. And we see some enterprises doing some interesting things there. But in terms of mass adoption, I think a lot of people are still trying to just get their warehouse implementation into a good place and unify their data. So I want to ask about that from someone who is playing in that third-party Snowflake ecosystem. How about you? What are you interested in?

Kostas Pardalis 01:32

Yeah, Eric, I think this conversation is going to have a lot of Snowflake in it. One thing is what you’re talking about, which has to do more with the ecosystem around the data platforms like Snowflake. But the other and the more technical side of things is how you can implement these sophisticated algorithms around graph analytics on top of a columnar database like Snowflake. So yeah, I think both from a technical and business perspective, we are going to have a lot of questions around how Affinio is built on top of Snowflake. And I think this is going to be super interesting.

Eric Dodds 02:07

Cool. Well, let’s dive in. Tim and Stephen, welcome to The Data Stack Show. We’re really excited to chat about data, warehouses, and personally, I’m excited to chat about some marketing stuff, because I know you play in that space. So thanks for joining us.

Tim Burke 02:21

We’re excited to be here. Thanks for having us.

Eric Dodds 02:23

We’d love to get a little bit of background on each of you. And just a high level overview of what Affinio, your company, does for our audience. Do you mind just kicking us off with a little intro?

Tim Burke 02:36

Yeah, absolutely. I’d be happy to. So it’s a pleasure being on with you guys. And realistically, just to give you a quick sense of what Affinio was all about, and a little bit of background. So we created Affinio about eight years ago. It started off with a really simple concept where eight years ago, Stephen and I happened to be running a mobile app B2C company. And instead of looking at social media to see what people were talking about our brand, we started off with a really simple experiment of looking at who else our followers on social media were following. And that afternoon, we aggregated that data and saw a compelling opportunity against this interest and affinity graph that nobody seemed to be using or utilizing for basically advertising and marketing applications. And we thought it was just a huge opportunity. So we doubled down and created what continues to be our core intellectual property, which is a custom built graph analytics engine under the hood. And what we’ve done is, over those eight years, basically leveraged analyzing essentially social data as a starting point. But more and more, we had many of our enterprise customers really excited about what they could unlock from both insights and actionability against the data that we were providing them with, as well as basically using our technology. So over the last two years, we made a conscious effort to double down and start porting a lot of that core graph technology directly into Snowflake. And most recently, and we’re just about to announce, the release of four of our essentially apps inside the Snowflake marketplace, that enable organizations to essentially use our graph technology directly on their data, without us ever seeing the analytics and without us ever seeing the output. So it’s in a completely private format, all leveraging the secure function capability in Snowflake and the data sharing capability. So super excited to be here. And we’re obviously huge fans of both Snowflake as well as warehouse first approaches, and we think the opportunity between Affinio and RudderStack is a great compliment.

Eric Dodds 04:38

Very cool. And Tim, do you want to just give a quick 30 second or one-minute background on you personally?

Tim Burke 04:43

Yeah, certainly. So, I’m Tim Burke, CEO of Affinio. My background is actually in mechanical engineering, Stephen, who’s on the show and my CTO and co-founder. We’ve been working together for 12 years now. Both engineers by trade, he’s electrical, I’m mechanical. I do a lot of the biz dev and sales work within Affinio, obviously from my position, a lot of customer-facing activities. And I’ll let Stephen introduce himself.

Stephen Hankinson 05:08

Now Stephen Hankinson, CTO at Affinio. Like Tim said, I’m an electrical electrical engineer. But I’ve been writing code since I was about 12 years old and I just really enjoy working with large data, big data, and solving hard problems.

Eric Dodds 05:21

Very cool. Well, so many things to talk about, especially Snowflake and combining data sets. And that’s just a fascinating topic in general. But one thing that I think would be really interesting for some context, so Affinio started out, providing graphs in the context of social. And one thing I’d love to know. So you started eight years ago, and the social landscape, especially around privacy and data availability, etc, has changed drastically. And so just out of pure curiosity, I’m interested to know what were the kinds of questions that your customers wanted to answer eight years ago, when you introduced this idea? And then how has the landscape impacted the social side of things? I know, you’re far beyond that, but you have a unique perspective in dealing with social media data over a period where there’s just been a lot of change, even from a regulatory standpoint?

Tim Burke 06:23

Absolutely. I would say you nailed it on the head. It’s been a transformational period for data privacy, customer data privacy. And that first and foremost has probably been one of the biggest impacted areas, social data as a whole. So, we’ve definitely seen a massive transition. Right. I mean, I would say that a lot of that transition over the last few years is, is partially a a change in our focus for that exact reason, right, recognizing that deprecations in public API is deprecation available privacy aspects of that data availability across social has changed drastically, right. And so, for us, we’ve been at the front of the line watching all this happen in real time. But for us, the customers at the end of the day are still trying to solve the same problem, it’s how do I understand and learn more about my customers such that I can service them better, provide better customer experience, find more of my high value customers, like, net-net, I don’t think the challenge has changed, I think the assets against which those the data assets against which those customers are actually leveraging to find those answers are going to change and have been changing, right. And so what we’re trying to do is our move from our legacy social products. Much of the time was addressing deeper understandings of the interest profiles and rich interest profiling of large social audiences is kind of where we get started. And for us, that’s one of the most obviously, one of the most valuable assets or valuable insights for a marketer, because when you understand the scope and depth of your audience’s interest patterns, you can basically leverage that for where to reach them, how to reach them, how to personalize content, knowing what offers they’re going to want to click through to. And I don’t think that’s actually changed, right? I think that what people are recognizing more so than anything, and obviously, you guys would see this firsthand, as well as many of those data assets that I think many organizations were willing to either have vendors collect on their behalf or own on their behalf, it has changed drastically. And now it’s basically requiring these enterprises and organizations to own those data assets and be able to do more with them. And so what I would say is, what we’re seeing firsthand is, the markets come around to recognizing the need to collect a lot of first-party data. Many organizations have obviously put a lot of effort and a lot of energy and a lot of resources behind creating that opportunity within an enterprise. But I would say quite honestly, what we see is that there’s a lack of ability to make meaningful insight and actionability from those large datasets that they’re creating. So that’s kind of what our focus is on is trying to enable the enterprise to be able to unlock at scale applications no differently than what we’ve done previously on massive social data assets. But this time, on their first party data, and natively inside Snowflake in a privacy-first format.

Eric Dodds 09:27

Super interesting. And just one more follow up question to that. I’m at risk of nerding out here and stealing the microphone from Kostas for a long period which I’ve been known to do in the past. But in terms of graph, was the transition from third-party social data to accomplishing similar things on your first-party data on Snowflake? Was that a big technological transition? I’d just love to know from under the hood standpoint. How did that work? Because the data sets have similarities and differences.

Tim Burke 10:05

No, it’s a great point. I mean, for those not familiar with graph technology, obviously, the foundation of traditional graph databases are founded on transforming relational databases into nodes and edges, right, and looking for essentially, connectivity or analyzing the connectivity in a data asset. So our underpinning data technology which Stephen created firsthand is this custom built graph technology, it analyzes data based on that premise, it is everything’s a node, everything’s an edge. And at the primitive level, it enables us to ingest and analyze any format of customer data without having to do drastic changes to the underpinning technology. And so what I would highlight is that we’re the most compelling data assets that we can analyze, and the most compelling insights you can gather, typically are driven by customer behavioural patterns, right. So unlike traditional, I would say demographic data, which has its utility, and obviously always has in a marketing and advertising application, but I would argue that demographics has traditionally been used as a proxy to a behavioral pattern, right? And what we see, and what we see, the opportunity to unlock is that if you’re analyzing and able to uncover patterns inside of raw, customer behavioral patterns, what you as a marketer or an advertiser want to do is ultimately change or enhance that behavior, right. So instead of using demographics as a way to slice and dice data and create audiences, which ultimately are simply a surrogate to that underpinning behavior looking to change, what we’re seeing, and what we see as an opportunity is across these massive data sets that are basically being pulled into Snowflake and aggregated in Snowflake, when you start to analyze those behaviors at the raw level and unlock patterns across massive number of consumers at that level, you can then start actioning on that, and leveraging those insights for advertising, personalization targeted campaign, next price offer in a format that basically is driven by you unlocking that behavioral pattern.

Tim Burke 12:15

So for us, you can think of it you know, when I speak of customer behavioral pattern, everything that you know, relates to transactional data, content consumption patterns, search behavior click data, clickstream data, I mean, all those become signals of intent of interest, and ultimately, are a rudimentary behavior, which for us, we can ingest transform that data into a graph inside of Snowflake, analyze those connections and similarity patterns across those behaviors natively in the data warehouse. And then in doing so create, therefore, audiences around common interest patterns, and look alikes, and build propensity models off those behaviors. And so, so the transformation uniquely, I mean, I wouldn’t understate it. And Stephen, obviously put a lot of time into that transformation. I think it was more so that we had initially architected the underpinning technology for the purpose of a certain data set, what we unlocked and identified was, there was a host of first party data applications we could apply this tag to, and that was the initial aha moment for us in terms of moving it into a Snowflake instance and in Snowflake capability so that we can basically put it in apply to any customer behavioral data across that, across that data set.

Kostas Pardalis 13:31

That’s super interesting. I have a question. Stephen might have a lot to say about that. But you’re talking a lot about graph analysis that you’re doing. Can you explain to us and to our audience a little bit, how graphs can be utilized to do analysis around the behavior of a person or, in general, the data that you’re usually working with? Because from what I understand, like the story behind Affinio, when you started, right, you were doing analytics around social graphs, right, where the graph is like a very natural kind of data structure to use there. But how can this be extended to other products and to other use cases?

Stephen Hankinson 14:13

Yeah, I’d say one example of that would be in surfing patterns, like Tim had mentioned. Or, essentially, we can get a data set of basically sites that people have visited and even keywords on those sites and other attributes related to the site’s times that they visit them. And essentially, we can put that all together into a graph of people traversing the web. And then we’re able to use some of our scoring algorithms on top of that. So essentially, rank and score those surfing patterns so that we can essentially put together people or users that look similar into a separate segment or audience that then we can essentially pop up and show analytics on top of so people can get an idea of what that group of people enjoy visiting online or where they go or what the types of keywords that are more looking at online based on the data set that we’re working with. I guess that would be one example of a graph that’s not social, for example.

Tim Burke 15:11

And I’d just pick up on that Kostas as well. I mean, I think the thing that we see is that the as Stephen alluded to, at the at this lowest level of the signals that are being collected what we’re creating in just to liken it to a social graph, obviously, you have a follower pattern, which defines and creates essentially, the social graph, what we’re doing is taking those common behaviors as basically the the nodes and edges. So as Stephen alluded to, whether it be sites that people visit, whether it be content, similar content that they’re consuming, whether it’s the transactional history that looks similar to one another, the application effectively is just how we transform to your point, those those individual events into essentially a large customer graph on first party data within the warehouse. And then like I said, then from there, the analytics and applications are very, very similar, regardless of whether you’re analyzing a social graph, a transactional graph, or you know, a web surfing graph, it ultimately comes down to what your what your definitions are for those nodes and edges at the core.

Kostas Pardalis 16:16

Yeah. And what’s the added value of trying to describe and represent this problem as a graph instead of like, I don’t know, like more traditional analytical techniques that people are using so far?

Tim Burke 16:30

For us, it comes down to I mean, specifically segmentation at the core of what advertisers and marketers do on a daily basis is cut and slice and dice data, oftentimes, is restrictive to a single event, right? So find me the customers that bought product X, find me, the customers that viewed you know, TV show Y, oftentimes is restricted in the analytics capabilities. within the scope of that small segment. What we’re doing is we’re able to take that segment, look across all their behaviors beyond them beyond that initial defined audience segment. And by compiling all those attributes simultaneously, inside of Snowflake, we’re actually able to uncover the other affinities beyond that. So besides watching TV show X, right? What are the other shows that are of that audience are over indexing or have high affinity? Besides buying product Y, what other products are they buying? And those signals from a marketers perspective start to unlock everything from recommendation engine, next best offer, net new personalized customer experience recommendations, right in terms of recognizing that this group as a whole has these patterns.

Tim Burke 17:47

And that’s at the core when you think of it, you can certainly achieve that in a traditional relational database. If you have two, three, ten attributes per you know, per ID. When you start going into scale, we’re analyzing with our technology inside of Snowflake, you’re talking about potentially hundreds of millions of IDs, against tens of thousands to hundreds of thousands of attributes. So when you actually try to surface and say like, what makes this segment tick, and what makes them unique, trying to resolve that and identify the top-10 attributes of high affinity to that audience segment is extremely complex in a relational database or relational format. But using our technology and using graph technology, the benefit is that that can be calculated in a matter of seconds inside the warehouse, so that people like marketing, and advertisers can unlock those over-indexing high affinity signals beyond the audience definition that they first first implied, and that helps with everything, like I said, understanding the customer all the way through to things like next best offer as well as media platforms of high interest.

Kostas Pardalis 18:53

Right. That’s super, super exciting for me. I have a question. That’s more of like a product related question about not technical, but how do you expose this kind of structure to your end user, which from what I understand is marketeer, right. And I would assume that most of the marketeers don’t really think in terms of graphs, or it’s probably like something a little bit more abstract in their heads. Like, can you explain to me how you manage to expose all these expressivity that the graph can offer to this particular problem? To a non-technical person like a marketeer?

Tim Burke 19:28

Yeah, no, for us? I mean, it’s a great question. For us. A lot of what we created eight years ago is, and even the momentum on on our social application eight years ago, was the simplicity of those identifying those over indexing signals, the ability to do unsupervised clustering on those underpinning behaviors to unlock what I would deem these data driven personas. And so we put a lot of energy into trying to restrict how much data you surface to your end user and trying to simplify it based on their objective. And so you know, a key element to that and recognizing that within the framework of these applications that we built inside Snowflake, our end user actually does not get exposed to the underpinning, graph-based transformation and all the magic that’s happening inside of Snowflake. What they do get exposed to, and what our algorithms able to do is essentially surface in rank order, the importance of those attributes, and place those into a simple UI experience. And the benefit at the end of the day, is that because all these analytics are running natively, inside Snowflake, any application that has a direct connector to Snowflake can essentially query and pull back these aggregate insights. So think of that from you know, from a standard BI application that has a standard connector into Snowflake, with, with very little effort, they can essentially leverage the intelligence that we’ve built inside of Snowflake, and pull forward, essentially based on an audience segment definition the over indexing affinities in rank order for that particular population. So, I think the challenge for us, I think you nailed it, for many in the marketing field, graph technology is not one of their primary backgrounds, and certainly not you know, if you ask them how would you use a standard graph database, that’s not something that most people are thinking about. What they are, though thinking about and thinking hard about is, again, it’s these simple definitions of like, what are the other things or what are the things that make an audience segment unique, make them tick, make them behave the way they behave. And unless you approach that problem statement, with a graph-based technology under the hood, it’s extremely complicated, extremely challenging. And for many organizations we work with they talked about the fact that what we’re unlocking inside the warehouse in a matter of seconds, would traditionally have taken a data science team or an analyst team, oftentimes days if not weeks to try to unlock and so, for us, it becomes scalability, it’s the it’s the repeatability of these types of questions that guys like, Eric, I’m sure live and breathe every day is like, what makes a unit of an audience tick, right? And whether that is like, of the people who churn what are the over indexing signals so that we can plug those holes in the product, whether that’s of the high value customers, what makes their behavior on our platform unique, those are the things that we’re trying to unlock and uncover for a non technical end user. Right, because that is their daily activity, they have to crack that nut on a daily basis in order to achieve their KPIs. And so, that’s what we’re most excited about is we, I think Stephen and I, eight years ago, graph technology, certainly, as it pertained to applications and marketing, was really still very, very new, I would still say it’s still very, very nascent. But I mean, I think it’s coming of age, because as we grow the data assets inside of things like Snowflake’s data warehouse, unless you can analyze across the entire breadth of that data asset, and unlock in a an automated way, these key signals that , make up an audience, but the challenge will always be the same. And the challenge is going to get worse, right? Because we’re not making datasets smaller, we’re making them larger. And so the complexity and challenge associated with that just increases with time and for us, like that’s what we’re trying to, we’re trying to trivialize and say, listen, there’s repeatable requests to a marketing analyst and to a marketing team and to an advertiser and a media buyer and, and dominantly they’re affinity based questions whether people recognize it or ask it as such. But a lot of the time, that’s exactly what it is. Of the person who just signed up on our landing page, right? Like, what should we offer them? Right? What are their signals? Can you know what kind of signals influence what we recommend to them, how we manage them, how we manage the customer experience, how we personalize content. So those types of questions we see on a daily basis are trying to be addressed by marketing teams, many of whom don’t have direct access, obviously, to the raw data. And that’s why a lot of our technology natively inside of Snowflake is unlocking the ability for them to do that in aggregate without ever being exposed to private or row level data.

Kostas Pardalis 24:21

That’s amazing. I think that’s one of the reasons that I really love working with these kinds of problems and engineering in general, this connection of something so abstract, as a graph is to a real life problem, like something that a marketeer is doing every day. I think that’s a big part of the beauty behind doing computer engineering, and I really enjoy that. But I have a more technical question now. I mean, we talked about how we can use these tools to deliver value to the marketing community. So how did you go from a columnar database system like Snowflake into a graph processing system? How do you do that? How did you bridge these two different data structures at the ends from one side, you have more of a tabular way of representing the data or columnar way of representing the data. And on the other hand, you have something like a graph. So how, how do these two things work together?

Stephen Hankinson 25:20

Yeah, so basically, what we end up doing is we have some secure functions in our arsenal account that we share over to the customer. And then what that does is, it gives them a shared database, which includes a bunch of secure functions that we’ve developed. And then we essentially work with the customer to give them either predetermined functions that are queries that they will run on top of their data, based on the structure of their tables. And the function, the queries that we would give to them essentially will pass their raw data in through our encoders and that will output this new data into a new table. And that really just looks like a bunch of garbage if you look at it in Snowflake. It’s mostly binary data, but it’s a probabilistic data structure that we store our data into. And then with that probabilistic data structure, they can then use our other secure functions, which is able to analyze that graph based data and output all the insights that Tim was mentioning before. Essentially, you just feed in a defined audience that you want to analyze. And it will run all the processing in the secure function on top of that, the probabilistic data structure, and then output all the top attributes and scores for the audience that they’re analyzing.

Kostas Pardalis 26:38

Oh, that’s super interesting. Stephen, can you share a little bit more information about this probabilistic data structure?

Stephen Hankinson 26:45

Yeah, it’s essentially putting it in a privacy safe format, that basically is feeding in all the IDs with different attributes that they want to be able to query against, essentially, using some hashing techniques to essentially compute this new structure that has been able to be bumped up against other encoded data sets of the same format. And then once you mash them together, essentially, you can use some algorithms that we have in our secure function library. And from there, you get all kinds of things like intersections, overlaps, unions of all kinds of sets. It’s basically doing a bunch of set theory on these different data structures in a privacy secure way.

Kostas Pardalis 27:29

Yeah, that’s super interesting. And there’s, I mean, there’s a big family of database systems, which is actually graph databases, right? So from your perspective, why is it better to implement something like what you described, like, compared to getting the data from a table like on Snowflake, and fitting it into a more traditional, let’s say, kind of graph system.

Stephen Hankinson 27:55

I think that is the main benefit of doing it this way, because they don’t need to make a copy of their data and they don’t need to move their data, it simply stays in one place.

Tim Burke 28:03

Yeah. And I would, I would just add to that Kostas as well, right. I mean, when we speak of the benefits of Snowflake’s underpinning architecture, and the concept of not moving data, for us what we’re not what we’re not trying to do is replicate all functionality of a graph database, there’s obviously applications in which case that is absolutely suitable and reasonable to do, and make an entire copy of the data set and run the type of analytics inside the warehouse. But what we’re trying to do is take the applications where all the marketing and advertising, productize them in a format that does not require that and still leaves the data where it is inside of Snowflake provides this level of anonymization. And I would also highlight the fact that Stephen’s code that does the encoding of that new data structure also enables out of five to one data compression format, which also supports basically more queries for the same price when it comes down to this for this affinity based querying structure.

Kostas Pardalis 29:03

Yeah, that’s, that’s very interesting, like, this discussion that we’re having about, like the comparison of between, like having a more niche kind of system around graph processing and general kind of graph database. It’s something that reminds me a little bit of something that happens also, like here, at RudderStack from an engineering point of view, because we have built a part of our infrastructure and needed some capability similar to what Kafka offers, right? But instead of incorporating Kafka in our system, we decided to go and build part of this functionality over Postgres in a system that’s tailor-made for exactly our needs. And I think that like finding this trade off between a generic system towards a function and something that is tailor-made for your needs. It’s like what makes engineering as a discipline super, super important. I think at the end, this is the essence of like making engineering choices when we’re building complex systems like this, trying to figure out when we should use something like more generic as a solution, or when we should get a subset of this and like, make it tailor-made like for our problem that we’re trying to solve. And that’s extremely interesting. I love that I hear this from you. We had another episode in the past with someone from Neo4J. And we were discussing this because if you think about it, like a graph database, at the end is a specialized version of a database, right? Like at the end database system, Postgres, can replicate the exact same functionality that the graph database system can do, right? But still, we are focusing more on a very narrowly defined problem, and we can do it even more. And that’s what you’ve done. And I find a lot of beauty behind this. So this is great to hear from you guys.

Tim Burke 30:48

I think it’s also interesting, just picking up on that in terms of the decision around like when do you optimize versus leave it generic. For us a big part of that, you can also see obviously, in the market, right, there’s machine learning and machine learning platforms that can have a host of different models that can be used for a host of different things, the Swiss Army Knife application within an organization. For us anyway, when those custom requests come in from teams, absolutely. Like those types of platforms make a lot of sense, because your data science team has to go in, it’s probably a custom model and the custom question that’s being answered, I think, for us specifically, when it comes time to actually building an optimized solution, something that can be deployed natively inside Snowflake, it comes down to repeatability and efficiency, right. So it’s like, when the same requests are being made hundreds of times a year, should you actually be building a custom model every time? Or should you actually push that workload into the warehouse? And for us anyway, that’s been a specific focus. For those applications of those requests that you can have the marketer self-serve and get the answers they need in seconds, as opposed to putting it on the data science team backlog. Those are the applications for us that we’re focused on and actually pushing in and optimizing.

Kostas Pardalis 32:09

Yeah, yeah, I totally agree. So last more technical question from my side, you mentioned that the the way that like the system works right now is you get the raw data that someone has started in Snowflake, and you have some kind of like encoders or transformers that they transform this data into this probabilistic data structures that you have? Do you have any kind of limitations in terms of what data you work with? Do you have some requirements in terms of the schema that this data should have and what’s like the pre-processing that the user has to do in order to utilize the power of your system.

Stephen Hankinson 32:46

So if it’s in essentially rectangular form data, it’s pretty easy to adjust into the encoder, like we have a UI that will do that for you. But if there are some weird things about the data, that wouldn’t be typical, we can actually work with them. If they give us an example of what the data looks like, we can essentially craft an encoding for a query for them, they just feed everything through. And that will still end up in the right way to go into our encoder and still end up in the central probabilistic data graph format that we use. So we haven’t currently run into any dataset that we haven’t been able to encode. But yeah, it seems to be pretty generic at this point.

Kostas Pardalis 33:28

And is this process something that the marketeer is doing or there’s some need for support from the IT or the engineering end of the company?

Stephen Hankinson 33:36

We usually work with IT at that stage. And then once it’s encoded, the UI will work with the data that’s already being coded. And they can also set up tasks inside of Snowflake, which will update a database over time, or that data set over time to add new records or delete the data as it comes in. But yeah, that is not enabled by the marketeer.

Kostas Pardalis 33:56

All right. And is Affinio right now offered only through Snowflake? Is there like a hard requirement that someone needs to have their data on Snowflake to use the platform?

Tim Burke 34:06

It is currently Kostas. I mean, we obviously went through an exercise evaluating which platform to build on first. I mean, for us, it came down to the two fundamental capabilities within Snowflake, or probably three, I mean, the secure functions that we’re utilizing to obviously secure our IP in terms of those applications that we share over, the ability to do the direct data sharing capability, it was fundamental to that decision. And then the third for us is obviously the cross cloud application and the cross cloud ability for us to ingest data across all three clouds. And so as a starting point, that was our buy in and recognizing that with the momentum, and certainly the momentum that Snowflake has, in specifically the media and entertainment and retail and advertising space, is and continues to be a good fit for our applications at this stage. We’ve had obviously discussions more broadly whether we can replicate for specific cloud applications, but where we are right now, in terms of early market traction, like our bet is on Snowflake and the momentum that they currently have.

Kostas Pardalis 35:12

And Tim, you mentioned, I think earlier that your product is offered through the marketplace that Snowflake has. Can you share a little bit more about the experience that you have with the marketplace? How important is it for your business? And why?

Tim Burke 35:28

Yeah, so I think the marketplace is still in its early stages even with as many data partners that are already bought in. For us, I think one of the clear challenges that we face, we are not data providers. So I think we’re slightly nuanced within the framework of what traditionally has been built up on from a data marketplace, or data asset perspective. We were positioned inside a marketplace deliberately and consciously with Snowflake, because our applications drive a lot of the data sharing functionality and add to the capabilities on top of that data marketplace, you know that people can apply first, second, third party data assets inside of Snowflake and run our type of analytics on top of it. So for us, it’s been unique in the framework of simply being positioned, obviously, almost as a service provider inside of what otherwise is currently positioned as a data marketplace. But recognizing that I think over time, you’ll start to see that bifurcate within Snowflake, and you’ll get a separation and a unique marketplace that will be driven by service providers like ourselves, alongside of straight data providers.

Tim Burke 36:41

So I think it’s, I think it’s early stages. I think what we’re excited about is that we see a lot of our technology is being an accelerant to many of those data providers directly. And many of the ones that we’ve already started working with directly see it as see it as a value proposition and a value add to their raw data asset that they may be sharing through Snowflake, but you’ll see it as a means with which to get more value from that data asset on the customer’s behalf by applying our application or technology in their Snowflake instance.

Kostas Pardalis 37:11

This is great. Tim, you mentioned a few things about your decision to go with Snowflake. Can you say a little bit more information around that and more specifically, what is needed for you to consider going to another data cloud data warehouse, something like BigQuery, or something like I don’t know, Redshift? What is offered right now by Snowflake that gives tremendous value to you, and makes you like, prefer, at this point build only on Snowflake?

Tim Burke 37:43

Yeah, I think if we stood back and actually looked at where Stephen and I started off in terms of our applications within first party, like porting our graph technology into first party data, much of that was very centered on applications and analytics specific to this and enterprises own first party data only, as it pertains to that model, if it was only restricted to that model, I think we would have considered more broadly looking at doing that directly inside of any or all of the cloud infrastructures or cloud-based systems to begin with, but but I would say that ours is a combination of the ability to do analytics directly on first party data of as well, as Stephen indicated a major component of our technology that we’ve created inside of Snowflake, and unlocks this privacy, safe data collaboration across the ecosystem. And so as a result of that, I mean, for us the criteria in terms of selecting Snowflake was, again, the ability to leverage secure and UDF, secure functions to lock it and protect our IP that we’re sharing into those instances. But the second major component is the second half of our IP, which is effectively this privacy safe data collaboration, which basically is powered by the underpinning data sharing capability of Snowflake. And so if and when reviewing or evaluating other applications or other providers in terms of context of where we report this next, I would say that that’s the lens that we look through, right is like, can we can we unlock the entire capability across this privacy safety data collaboration and analytics capability in a similar way that we’ve done it on Snowflake? Because to me, that’s the primary reason why we picked that platform.

Kostas Pardalis 39:34

Yep. And one last question for me, and then I’ll leave it to Eric. And it’s a question for both of you guys. Just from a different perspective. You’ve been around quite a while. I mean, Affinio as you said, started like eight years ago, that was pretty much like I think, at the same time, that Snowflake also started. So you’ve seen a lot around the cloud data warehouse and its evolution, how have things changed in these past eight years, both from a business perspective, and this is probably something more more of a question for you, Tim, and also from a technical perspective, how the landscape has changed.

Tim Burke 40:08

I think it’s absolutely interesting the point that you’re making. I mean, I first learned of Snowflake directly from customers of ours, who were, at the time, asking us specifically about the request. It is very simple. They say, we love what you’re doing with our social data, we would love it natively in Snowflake. And that was honestly the first time we had learned of that application many, many years ago. But what I would say is that as far as the data warehouse has advanced from a technical perspective, I think for us anyway, it still belongs, or certainly has its stronghold directly in the CDO, CIO, and CTO offices within many of these enterprises. What what I expect to see and what I think we’re helping you drive and pioneer with what we’ve built on marketing/advertising is the value of the assets being stored inside of you know, the data warehouse has to become more broadly applicable and accessible across the organization beyond what it traditionally has been locked away to high infosec required data science teams, because I think the value that needs to be tapped across global enterprises cannot funnel directly through just a single team all the time. And I think what we will see, and certainly I think, as early stages are starting to see is awareness by other departments inside the enterprise of even where their data is stored, quite honestly. I mean, there’s still conversations we’re having with many organizations in the marketing realm who have no idea where their data stored, right, so I think familiarity, and comfort level associated with that data asset, how to access what they can now access, how they can utilize, it will become the future, where the data warehouse is going to go. But I think we’re still, we’re still a long way there. There’s still a lot of education there. But we’re excited about that opportunity specifically from the business perspective.

Stephen Hankinson 42:03

And on the tech side of things, I would say the biggest changes are probably around the privacy stuff that has changed over the years where you have to be a lot more privacy aware and secure. And basically working with Snowflake makes that a lot easier for us with the secure sharing of code, and secure shares of data as well. So using that with our code embedded directly into them, you can begin to be sure that customers using this, their data is secure. And even if they’re sharing data over to other customers, it’s secure to do that as well.

Kostas Pardalis 42:35

This is great, guys. So Eric, it’s all yours.

Eric Dodds 42:40

We’re closing in on time here. But I do have a question that I’ve been thinking about, really since the beginning. And it’s taking a step back. So Kostas asked some great questions about “why Snowflake?” and some of the details there. Stepping back a little bit, I would love your perspective on what I will call for the purposes of this, the purposes of this episode, the next phase of data warehouse utilization, and I’ll explain what I mean a little bit. So a lot of times on the show, we’ll talk about major phases that technology goes through. And in the world of technology and data warehouses, they’re actually not that old. You have Redshift being the major player fairly early on. And then Snowflake hit general availability, I think, in 2014, but even then they were still certainly not as widespread as they are now. And the way that we describe it is, we’re currently living in the phase of, everyone’s tried to put a warehouse in the center of their stack, and collect all of their data, and do the things that you know, the marketing analytics tools have talked about for a long time where it’s like, get a complete view of the customer, and everyone realized, okay, I need to have a data warehouse in order to actually do that. And that’s a lot of work. And so we’re in the phase where people are getting all of their data in the warehouse, it’s easier than ever. And we’re doing really cool things on top of it. But I would describe Affinio in many ways as almost being part of the next phase. And Snowflake is particularly interesting here, where, let’s say you collect all of your data. Now you can combine it with all other sorts of things native, which is an entirely new world, right? There are all sorts of interesting data sets in the Snowflake marketplace, etc. But most of the conversation and most of the content out there actually is just about how do you get the most value out of your warehouse by collecting all your data and doing interesting things on top of it. And so I just love your perspective, do you see the same major phases? Are we right in terms of being in the phase where people are still trying to collect their data and do interesting things with it, and then give us a peek As a player who’s in a part of the marketplace, part of the third party connections, but being able to operationalize natively inside your warehouse, what is that going to look like? I mean, marketing is an obvious use case. But I think it’s going to be in the next five years, that’s going to be a major, major movement in the world of warehouse. Sorry, that was long winded, but that’s what’s been going through my mind.

Tim Burke 45:23

Totally, I mean, it is the stuff that we think about and talk about on a daily basis, I think you’re right. I think obviously, the world has already woken up to the fact that gathering, collecting, owning, and managing all customer data at one location is going to be critical in the future, right? I would say COVID has woken the world up to that, in terms that as many of us, as you know, have heard and seen, is that COVID is no better driver for digital transformation than a pandemic. But at the same time, I completely agree with you. What I think personally, and I, and I just given what we’re creating within these native applications inside of Snowflake, I think you will start to see an emergence of privacy safe SaaS applications that are deployed natively inside the warehouse. I think you will, you will see, literally a transformation of how SaaS solutions are being deployed. And I think what you’ll see is organizations like Affinio who have traditionally hosted data on behalf of customers and provided web based logins to to access that data that’s stored by the vendor, I think you’ll see and continue to see a movement where the IP and the core capabilities and the technologies of these vendors will begin to store and start to port natively into Snowflake. I believe that Snowflake itself, and we’ll actually start to find ways to find attribution around the compute and value that those you know, that those vendors like ourselves in the applications that are driving inside of the warehouse, and I think you’ll see that just naturally extend into rev share models, where for the enterprise you sign on to Snowflake, you have all these native app options that you can turn on, automatically, that’ll basically allow you not only to reap more benefits, but just get up to speed and make your data more valuable faster. Right. And I think honestly Steve, and I’ve talked about this for some time now, we honestly see that in the next 10 years, there’ll be a transition. And certainly, maybe it probably won’t eliminate the old model, but you’ll see a new set of vendors that will start building in a native application format right out of the gate, and that I think, will transform the traditional SaaS landscape.

Eric Dodds 46:13

Yeah, absolutely. And one, a follow on to that. So when you think about data in the warehouse, you can look at it from two angles, right? The warehouse is really incredible, because it can really support you know, any well, not necessarily any kind of data, right, but data that conforms to any business model, right. So b2c, b2b, etc. It’s agnostic to that, right, which makes it fully customizable, and you can set it up to suit the needs of your business. So in some sense everyone’s data warehouse is heavily customized. When you look at it from the other angle, though, from this perspective of third-party data sets, and something that Kostas and I talk a lot about, which is common schemas or common data paradigms, right? If you look across the world of business, you have things like Salesforce, right. Salesforce can be customized, but you have known hierarchies, lead contact accounts, etc. Do you think that the standardization of those things, or market penetration of known data hierarchies and known schemas will help drive that? Or is everyone just customizing their data? And that won’t really play a role?

Tim Burke 49:01

You know, that’s a great question. I mean, it’s conversations we had with other vendors in, in many of our customers relative to what they perceive is beneficial to many cdp’s and market, to your point, Eric, right, like where the fixed taxonomies and schemas basically enable an ecosystem and an app ecosystem and partner ecosystem to build easily on that schema on top of that, yeah, completely. You know, I would say that I think it’s still early to see how that actually comes about what I would, what I would say is that I think you will start seeing organizations adopt many aspects within Snowflake and within their warehouse of best of breed schemas for the purpose of as you know, as I would say, as I see this applications, space build out, it’s kind of the way that it has to scale right. So, both from from a partner in marketplace marketplace play as well as the plug and play nature of how you want to deploy this at scale, I mean, ultimately, the game plan would be that, again, all these apps run natively, you could turn them on, they already know what the scheme is behind the scenes, and they can start running, as Stephen alluded to, there’s obviously at this stage, a lot of hand holding at front end, until you get those schemas established and are encoded into a format that’s queryable, etc. So I think what you’ll start to see is best of breed bridging across into Snowflake would be my assumption that I would say. The more that you see, people leveraging Snowflake as a build your own format of Snowflake. It’s kind of required, right? And I wouldn’t be surprised to see some elements of that be adopted across into best of class and best of breed within Snowflake directly for that purpose.

Eric Dodds 50:47

Sure, yeah, it’s kind of, it’s fascinating to think about a world where today, you kind of have your set of your core set of tooling, right, and core set of data and you build out your stack by just making sure that things can integrate in a way that makes sense for your particular stack, which in many cases, requires a lot of research, etc. And it’s really interesting to think about the process of architecting a stack, where you just start with a warehouse, and you make choices based on best of breed schemas. And you know, at that point that the tooling is heavily abstracted, right? Because you are basically choosing time to value in terms of best of breed schemas, super interesting.

Tim Burke 51:37

Yeah, completely.

Eric Dodds 51:39

Alright, well, we’re close to time here. So I’m going to ask one more question. And this is really for our audience, and anyone who might be interested in the Snowflake ecosystem, what’s the best way to get started with exploring third party functionality in Snowflake? I mean, Affinio, obviously, a really cool tool, check it out. But for those who are saying, okay, we’re kind of at the point where we’re unifying data, and we want to think about augmenting it. You know, where do people go? What would you recommend as the best steps in terms of exploring the world of doing stuff inside of Snowflake natively, but with third party tools and third party datasets?

Tim Burke 52:18

I think it all starts with, from our perspective, many of the conversations we have with prospects and customers are around what questions are the repeatable ones you want to get addressed and want to answer it. And in combination with that, obviously, a key element to what you know, these types of applications enable it from a privacy perspective is to unlock the ability to answer those types of questions by more individuals across the organization. So many of the starting points for us ultimately comes down to what are those repeatable repeatable questions and repeatable work workloads that you’d like to have trivialized, and basically plug and play inside of the warehouse that will speed up what otherwise oftentimes is a three-week wait time or a three-week model or a three-week answer? And so I think for us, that’s where we start with most of our prospects and discussions. And I would think for those thinking about or contemplating that, that’s a great place to start is recognizing that this isn’t for this isn’t the silver bullet to address all questions or all problems. But for those that are rinse and repeat and repeatable, these types of applications are very, very powerful.

Eric Dodds 53:30

Love that. Just thinking back to my consulting days, or doing lots of analytics, or even tool choice for the stack. Always starting with the question, I think is just a really, I think that’s just a generally good piece of advice when it comes to data.

Eric Dodds 53:48

Well, this has been a wonderful conversation, Tim and Stephen, really appreciate it. Congrats on your success with Affinio. Really cool tool, so everyone in the audience, check it out. And we’d love to have you back on the show in another six or eight months to see how things are going.

Tim Burke 54:03

Yeah, I would love to.

Stephen Hankinson 54:05

Thanks very much.

Eric Dodds 54:07

As always, a really interesting conversation. I think that one thing that stuck out to me and I may be stealing this takeaway from you Kostas. So I’m sorry. But I thought it was really interesting how they talked about the interaction of graph with your traditional rows and columns warehouse, in the paradigm of nodes and edges. That’s something that’s familiar to us relative to identity resolution in the stuff that we work on and that we’re familiar with. And so kind of breaking down that relationship in terms of nodes and edges, I think was a really helpful way to think about how they interact with Snowflake data.

Kostas Pardalis 54:46

Yeah, yeah, absolutely. I think this part of the conversation where we talked about different types of representation of the data and how its representation can be more well treated like for specific types of questions. It was great. And if there’s something that we can get out of this is that there’s this kind of conception of the data that remains the same at the end, while it is expressed as part of the data, it’s the same thing, right? It doesn’t matter if you represent as a graph, as a table, or at the end as a set. Because if you notice, like the conversation that we had, at the end, they end up representing the graph using some probabilistic data structures that at the end represent sets, and they do some set operations there to perform their analytics. And that from a technical perspective is very interesting. And I think this is a big part of what actually computer engineering and computer science is about right? Like how we can transform from one representation to the other, and what kind of expressivity these representations are giving to us, keeping in mind that at the end, all these are equivalent, right? Like, the type of questions that we can answer are the same. It’s not like something new will come out from the different representations. It’s more about the ergonomics of how we can ask the questions, how more natural the questions fit to these models, structures, and in many cases, also around efficiency. And it’s super interesting that all these are actually built on top of a common infrastructure, which is the data warehouse, in this case, Snowflake. And that’s like a testament of how much of an open platform Snowflake is. Although I mean, in my mind at least the only other system that I have heard of being so flexible, it’s like Postgres, but Postgres, like a database, exists for like forever, like, like 30 years, or something. Like Snowflake is a much, much younger product. But still, they have managed to have an amazing velocity when it comes to building the product and the technology behind it. And I’m sure that if they keep up pace, we have many things to say in the near future, both from a technical and business perspective.

Eric Dodds 56:55

Great. Well, thank you so much for joining us on the show. And we have more interesting data conversations coming for you every week, and we’ll catch you on the next one.

Eric Dodds 57:08

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at Eric@datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

39: Diving deeper into CDC with Ali Hamidi and Taron Foxworth of Meroxa

On this week’s episode of The Data Stack Show, Eric and Kostas have Meroxa back on the show, this time talking with co-founder and CTO Ali Hamidi and developer advocate Taron Foxworth. Together they discuss uses and implementations of change data capture, formulating open CDC guidelines, and debate the use of reverse ETL. 

Highlights from this week’s episode include:

  • Meroxa is a real-time data engineering managed platform (4:53)
  • Use cases for CDC (6:20)
  • Meroxa leverages open source tools to provide initial snapshots and start the CDC stream (12:29)
  • Making the platform publicly available (14:14)
  • What the Meroxa user experience looks like (16:10)
  • Raising Series A funding (17:49)
  • Easiest and most difficult data sources for CDC (20:23)
  • The current state of open CDC (23:16)
  • Expected latency when using CDC (29:56
  • CDC, reverse ETL, and a focus on real-time (36:39) 
  • Are existing parts of the stack when Meroxa is adapted? (39:45)

The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription

Eric Dodds  00:06

Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack the CDP for developers. You can learn more at RudderStack.com.

Eric Dodds  00:23

So really excited to have them back on the show. One thing that I’m interested in, and I want to get a little bit practical here, especially for our audience, one of the questions I’m going to ask is, where do you start with implementing CDC in your stack? It’s useful in so many ways. It’s such an interesting technology. But it’s one of those things where you can kind of initially think about it, and you’re like, Oh, that’s kind of interesting. But then you see one use case, and you start to think of a bunch of other use cases. And so I want to ask them, where do they see their users and customers start in terms of implementing CDC? And maybe even what does it replace? Kostas, how about you?

Eric Dodds  00:23

All right, welcome back. We are doing something on this episode that we love. And this is when we talk with companies who we talked with a long time ago, and in the podcast world, a long time ago is maybe six months or so, which for us is about a season. So one of our early conversations in season one was at a company called Meroxa, and they have built some really interesting tooling around CDC. We talked with DeVaris, one of the founders. And we get to talk with one of the other founders, Ali, and then a dev evangelist named Taaron Foxworth today from Meroxa. And they, I think, recently raised some money, and have built lots of really interesting stuff since we talked with Devarius.

Kostas Pardalis  01:49

Yeah. First of all, I want to see what happened this almost one year since we spoke with DeVaris. And one year for a startup is like a huge amount of time. And it seems like they are doing pretty well. I mean, as you said, Eric, very recently, they raised their Series A. So one of the things that I absolutely want to ask them is what happened in this past year. And also, I think that just like a couple of weeks ago, they also released their product publicly. So I want to see the difference between now and then. That’s one thing.

Kostas Pardalis  02:23

And the other thing is, of course, we are going to discuss in a much, much more technical depth about CDC. And I’m pretty sure that we are going to have many questions about how it can be implemented, why it is important, what is hard around the implementation, and any other technical information that we can get from Ali.

Eric Dodds  02:43

Let’s jump in and start the conversation. All right, Ali and Fox, welcome to the show. We are so excited to talk with Meroxa again, we had DeVaris on the show six or eight months ago, I think, and so much has happened at Meroxa since then. And we’re just glad to have you here.

Ali Hamidi  03:03

Thanks. Thanks for having us.

Taron Foxworth  03:05

Yeah, I’m so excited to talk with you today.

Eric Dodds  03:07

Okay, we have a lot to cover, because Kostas and I are just sort of fascinated with CDC and the way that it fits into the stack. But before we go there, we talked with DeVaris, one of the founders. Could you just talk a little bit about each of your roles at Meroxa, and maybe a little bit about what you were doing before you came to Meroxa?

Ali Hamidi  03:27

Yeah, so I’m Ali, Ali Hamidi, and I’m the CTO and the other co-founder at Meroxa. And so before starting Meroxa with DeVaris, I was a lead engineer at Heroku just by Salesforce, specifically working on the hybrid data team handling Heroku Kafka, which was the mesh Kafka offering. But before that, you know, I’ve always been working in and around the data space, and did a ton of work around data engineering in the past.

Taron Foxworth  03:53

And I’ll go next. Hi, everyone. My name is Taron Foxworth. I also go by Fox at Meroxa. I am the head of developer advocacy. I spend most of my time now building material that helps customers understand data engineering and Meroxa itself. I also work a lot with our customers actually understanding how they’re using Meroxa and also trying to learn from them as much as possible. In the past I ran evangelism and education for an IoT platform. That’s kind of where I really jumped into this data conversation because you know, IoT generates a bunch of data, a bunch of sources. Then I joined Meroxa back in February to really dive into this data engineering world. And it’s been such a blast so far.

Eric Dodds  04:35

Very cool. Well, I think starting out it’d be really good. Just to remind our audience, we have many, many new listeners, since the last time we talked with Meroxa. Could you just give us an overview of the platform and a little bit about why CDC?

Ali Hamidi  04:53

Yeah, sure. So Meroxa is essentially a real-time data engineering managed platform. So essentially, it makes it very easy for you to integrate with data sources, pull data from one place, transform it, and then place it into another place in the format that you want. And so a big part of that for us is really the focus on CDC, change data capture. And you know, it’s been around for a while, but only recently really gained a lot of a lot of interest and a lot of attention. And so really, the value of CDC is, rather than taking a snapshot of what your source record or database looks like, at the time of, you know, making that request, CDC gives you the list of changes that are occurring in your database. So for example, if you’re looking at the CDC string within Postgres, anytime a record is created, updated or deleted, you’re getting that event, and it basically describes the operation. And so it gives you a really sort of rich view of what exactly is happening on the upstream source, rather than just, okay, this is the end result of what happened. It gives you sort of the story of what happened and it inserts that sort of temporal aspect.

Eric Dodds  06:04

There are so many uses for CDC. I’d love to know is Meroxa focusing on a particular type of use case or particular type of data as you’ve built the platform out?

Ali Hamidi  06:20

Yeah, so we kind of see CDC as … I kind of have an answer in two parts to that question. So one of the things that, you know, led us to focus on CDC is really, we were trying to look at the areas where we can add the most value, and really apply our expertise and sort of the experience that the team has, and sort of generate the most value for customers. And so one of the areas that we looked at is setting up CDC pipelines, CDC connectors, has always been really difficult for customers. And, you know, having spoken to lots of customers, difficult CDC projects can take upwards of, you know, 6-12 months, sometimes longer. And it’s just an inherently difficult project to get off the ground. And so really that’s one of the areas we thought, okay, we can apply our expertise and our automation and our sort of technical skills to make that easier. And so the goal of the platform, sort of the IP in the platform, is really doing the heavy lifting, when it comes to setting up these connectors. And so CDC seems like a natural place for us to focus on inherently because you know, it’s very difficult for people to do. And so if we can make it very easy, then there’s value in that. We also sort of view CDC as the sort of the superset of data integration, and the sense that you can create sort of the snapshot view of your data from the CDC stream. But you can’t really go the other way, you can’t sort of create data where there isn’t any. You can sort of compact the CDC stream into what the end result should be. And so if you’re starting from this richer, more granular stream of changes, then essentially, any use case that is covered by the sort of the traditional ETL or ELT use case can be also supported by the CDC approach. But it also unlocks sort of new things. And so a very contrived example, but I think one that kind of explains where the value and the addition of the temporal data is, if you, if you look at sort of an e-commerce use case, where you’re tracking what’s happening in shopping carts, then you know, whenever someone adds something to the cart, you could potentially, it’s a very naive approach, but you could represent that shopping cart as a recommend database. And then when someone adds something, you know, the increment the number of items to do. And so that would actually trigger an event that would land in the stream. Whereas if you’re looking at just the snapshot, then whatever you happen to look at that would be the number. And so if someone adds something and removes something, and that’s two of them, and then removes something, that’s all data, that’s potentially valuable. And that would land in the CDC stream of what exactly the user did. Whereas if you’re just looking at the snapshot, it’s, you know, the end result. And so if I added 10 things, and then I dropped it to only one, and that’s what I purchased. And then later when the snapshot happens, you’d only see the one thing you wouldn’t see the intermediate steps that I went through. So it’s a very contrived example. But I think it demonstrates the idea of this, the additional sort of rich data that you’re potentially leaving on the table by not using the CDC stream.

Kostas Pardalis  09:25

Ali, I have a question. So usually, when CDC gets into conversation it’s in two main use cases. One is I think you also mentioned ELT and ETL. And the other one is as part of a microservice architecture where you use CDC to feed data like different micro services. Do you see Meroxa so far being used more on one or the other?

Ali Hamidi  09:53

So we I mean, traditionally the approach that we’ve been pushing and kind of marketing for is the more traditional ELT use case, mainly because I think that’s easier to understand. And it’s more sort of common for people to kind of wrap their minds around. But the structure and architecture of the Meroxa platform is that essentially, the way it works is when you create a source connector, you’re pulling data from some database, say a Postgres through CDC, and it’s actually landing in an intermediate, so specifically Kafka, that’s managed by the Meroxa platform.

Ali Hamidi  10:27

And so this is where, you know, the second use case, or the, I’m not sure if I wanna use the term data mesh, because I feel like it’s pretty loaded and has a lot of baggage. But essentially, the sort of the application use case or microservices use case would sort of fall into place. Because these events, these change events are actually landing in an intermediate, but it’s being managed by us, that a customer also has access to. And so you know, what we typically see is, is customers will come for the sort of easier low hanging fruit sort of use case of ETL, or ELT, but then sort of almost immediately realized that oh, actually, once I have this change stream in this intermediate that I can easily tap into, now I can leverage it for other things. And so we have some features that kind of make that easier. An example of that is you can generate a GRPC endpoint that points directly into this change stream. And so you can have a GRPC client that receives those changes in real time. And so that kind of falls into the sort of microservices use case pretty well. But it is the same infrastructure. And that’s kind of the key for us. We view Meroxa as being sort of a core part of data infrastructure. And so we want to make it very easy for you to get your data out of wherever it is, and place it into an intermediate. So specifically Kafka that you can then hang connectors off and kind of peek into and and really leverage that data for whatever use you have.

Kostas Pardalis  11:45

Yeah, yep, that’s super interesting. So follow up question about ETL and ELT. So CDC has, let’s say, kind of like limitation, which it’s not like the limitation of CDC itself, it’s more about the interface that you have with a database that when you first establish a connection with the application log, you have access to very specific data sets in the past, right, you don’t actually have access to all the data that the whole state of the database has. And usually, when you’re doing ETL, you first want to replicate this whole state and then keep updating the state using like, CDC kind of API. So how do you deal with that at Meroxa?

Ali Hamidi  12:29

Yeah, so the tooling that we use, we obviously, you know, we like to say we stand on the shoulders of giants and, and leverage a lot of open source tools. And so the tooling that we use, you know, depending on which data source you’re connecting to, so say if you’re using Postgres, we’re likely to provision the Debezium connector, sort of behind the scenes, but that actually supports the idea of creating a snapshot first. And so it will basically take the entire current state and push that into the stream. And then once it’s called up, it will start pushing the changes by consuming the replication log. And so you do get both. You get like the full initial state as a snapshot, and then you get the changes once that initial snapshot is done. So that’s, that’s kind of how we address that use case.

Kostas Pardalis  13:13

Okay, that sounds interesting. So the initial snapshot, it’s something that you capture through a JDBC connection.

Ali Hamidi  13:18

Yeah.

Kostas Pardalis  13:19

Okay. Okay. That’s, that’s clear. That’s interesting. Yeah, it makes total sense. Because, you know, you need the initial state and validating the state. So yeah.

Ali Hamidi  13:30

And that’s supported across all of the data sources that we natively support. So whether it’s, you know, CDC through Mongo, or MySQL, or Postgres, they all work in a similar way. Do the initial snapshot, and then once that’s caught up, we actually start the CDC stream.

Kostas Pardalis  13:46

Super nice. So I know that your product became publicly available, like pretty recently, you were like in a closed beta for a while now, do you want to share with us a little bit about what to expect when I first sign up on the product? Some very cool features that you have included there? And I know I’m adding too many questions now. But also, like if there’s something coming up, like in the next couple of weeks, or something that you are very excited about?

Ali Hamidi  14:14

Sure. Yeah. So we launched and we made the platform publicly available, about a month ago, and it’s available at Meroxa.com. You can sign up, and we have a generous free tier. Our pricing model is based on events processed. And so you can go in and create an account, you can leverage the dashboard or the CLI. And as I mentioned, really, the RIP is making it very easy to get data out of a data source and making it very easy to input data. And so like an example of that is, I mentioned the CDC streams with with Postgres, but the platform will sort of gracefully degrade its mechanism for pulling data out of Postgres depending on where it’s being hosted, what version it’s running, what permissions the user has, and that kind of thing. The command or the process for the customers is uniform, it’s basically the same. And that also extends across the data sources. So you type the same command, whether you’re talking to Postgres running on RDS with CDC enabled, or you’re talking to Mongo on MongoDB Atlas. It’s basically the same command, same UX. And that’s really our edge, I guess. In terms of sort of features, really, what we’re pitching is the user experience. We’re trying to make it very, very easy to set up these pipelines and get data flowing. And that’s really where a lot of our attention has been focused.

Kostas Pardalis  15:35

That’s super interesting. And can you tell us a little bit more about the user experience? Like how do we interact? Like, for example, and I guess the product is intended mainly towards engineers, right? So is it like the whole interaction through the UI only? Do you provide an API that programmatically someone can like, create and destroy and update like CDC pipelines? What are your thoughts around that? What is like, let’s say, in your mind, also, as an engineer, right, like, the best possible experience that an engineer can have from a product like this?

Ali Hamidi  16:10

Yeah, so from my perspective, you know, being on the user side of this sort of work for many years, really, I felt most at home, working through SCLI or some kind of infrastructure automation. I’d love to use something like Terraform or a similar tool to kind of set up these pipelines. For us right now, we’ve launched with the CLI. So we have the Meroxa CLI, which has full parity with the dashboard. And we have the UI itself, which is the dashboard, so you can sort of visually go in and create these pipelines. We haven’t quite yet made a public API available, but it’s something that we’re definitely interested in and working towards. We’re just not quite there yet. And certainly, you know, I’m a huge fan of Terraform. And the idea of infrastructure as code, I think, is great. And it’s something that we definitely need to address. And that’s something that, you know, we’re looking forward to addressing in the future. But yeah, CLI right now, a dashboard through the UI. And this is a full parity between the two. Typically, the way you interact with it is you’d introduce resources to the platform. And so you’d add, you know, add Postgres, give it a name, add Redshift, give it a name, and then create a pipeline and create a connection to Postgres. The platform reaches out, inspects Postgres, figures out the best way to get data out, and starts pouring it into an intermediate Kafka. And then you kind of peek into that and say, Okay, take that stream, and write it into Redshift now. And the rest is handled by the platform.

Kostas Pardalis  17:40

That’s super interesting. By the way, I think we also have to mention that pretty recently, you also raised another financing round, Is this correct?

Ali Hamidi  17:49

Yeah. Yeah, we raised a pretty sizable series A. We closed sort of towards the end of last year, but recently announced it with Drive Capsule leading our series A. It’s been, you know, super amazing, working with them and the rest of our investors. And yeah, so you know, that enabled us to accelerate the growth of the team, really build out our engineering team and sort of the other supporting resources. So we went from about eight people last October to 27 as of today.

Kostas Pardalis  18:22

Oh, that’s great. That’s amazing. That’s a really nice growth rate. And I’m pretty sure you are still hiring. So yeah, everyone of our listeners out there, they want to work in a pretty amazing technology and be part of an amazing team. I think they should reach out to you.

Ali Hamidi  18:38

For sure. For sure. We’re always hiring, always looking for back end engineers, front end engineers. Yeah. If you’re interested in the data space, then we’d love to hear from you.

Kostas Pardalis  18:46

That’s cool. All right. So let’s chat a little bit more about CDC and the use cases around CDC. So based on your experience, so far, what are the most common, let’s say, sources and destinations for CDC? And why also? Like why do you think that people are mainly interested in it at this point, and the maturity that the product technology has right now, are interested in this?

Ali Hamidi  19:11

Yeah, so at least from from our point of view, and what we’ve seen and what customers are telling us, the most sort of common data sources would be Postgres, MySQL, MongoDB, SQL Server, really the the operational databases are the things that are backing these sort of common applications and API’s. That tends to be what people are asking us for. And so I think the reasoning behind that is really, that’s where the most value comes out. So you mentioned earlier, you know, the two sort of different paths for CDC use, the one being ELT, and one being like the microservices sort of application type use case. And I think there’s a really nice sort of appealing aspect of saying, Well, I don’t need to change any of my upstream application if all of the changes are happening in the database, I can just kind of look into that stream, and radiate that information across my infrastructure and start taking advantage of it. And so I think that’s why, you know, most of the use cases, most of the the requests are really around operational data source.

Kostas Pardalis  20:14

It’s interesting. Can you share a little bit of your experience with the different data sources? Which one do you think it’s like the easiest to work with in terms of CDC and which ones are the most difficult ones?

Ali Hamidi  20:25

Mainly because of my time at Heroku, Heroku was very famously, very strongly associated with Postgres. I’d argue that Heroku Postgres was probably the first solid sort of production grade Postgres offering that was available as a managed Postgres. And I think Postgres as a product itself is incredible. I think it’s really great to work with and its development has been super fast-paced, but always very stable. And I think the way that they have implemented, sort of replication has made it very, very useful for building out CDC on top of, and so that’s, I think, personally, that’s where I would kind of lean towards. I think to get like the premium CDC experience, Postgres is probably the best right now. I know that MongoDB has done a ton of work with their Streaming API, and sort of done stuff there to make that super easy too. But yeah, just for simplicity, and getting things up and running, Postgres is great for CDC. Mainly, because it leverages the built in replication mechanism.

Ali Hamidi  21:27

That being said, one of the things that we sort of continually see, and this is probably a good time to bring up the the initiative that, you know we’re trying to work on, amongst, you know, some partners and sort of industry peers, CDC itself has come a long way in terms of what it does and interest and where it can be applied. But I think there’s room for us to kind of agree as a community as a, as a collection of experts that work in the field, potentially some guidelines to make interoperability better. And so you have different companies building out, you know, CDC mechanisms, whether it’s someone building CDC natively into their product like CockroachDB, or someone like the Debezium team at Red Hat who are building these CDC connectors, I think there’s definitely an opportunity for us to sort of sit around a table and agree on, alright, if I want to provide a great CDC experience, I want to enable interoperability. So maybe I want to use, you know, Debezium on one end, and I want to pour that CDC stream into CockroachDB, let us agree on at least a style of communication, like some kind of common ground between us so that we can make this interoperability possible and make it easier for customers to really make use of that.

Ali Hamidi  22:45

And so, one of the things that we’ve been talking about, and I’ll let Fox kind of talk a little bit more about the the initiative in general, but we’re basically partnering up with some of our sort of industry partners to push the idea of an open CDC initiative, essentially, to kind of agree on what it looks like to implement CDC and support CDC and what it looks like to support it well.

Kostas Pardalis  23:09

Well, that’s super interesting. Yeah. I’d love to hear more about what’s the state of open CDC right now?

Taron Foxworth  23:16

Yeah, so I’d love to hop in here. This has been so informative. I’ve just been sitting here clapping my hands, soaking in knowledge and all that information about CDC. But open CDC is really, I think, an initiative that’s going to drive a lot of activity and community just around CDC in general, because like Ali mentioned, there are multiple ways you can actually start to capture this data, like Debezium, for example, leverages Postgres logical replication, to actually keep track of all the changes that are occurring. And the nice thing there is you get changes for every insert operation, update operation, delete operation. But there’s also other mechanisms of CDC as well, like, for example, one connection type is polling. Like you can constantly ask the database to look for a primary key increment. So when you know a new ID has come in, that’s a new entry or looking at a field may say updated at. So with all these different mechanisms of actually tracking the changes, some consistent format around systems around, okay, well, if you have a CDC event, you should be able to track here’s what snapshots look like, here’s what creates look like, here’s what updates look like, here’s what deletes look like. And what we can start to do is offer some consistency amongst these systems. So that CDC producers, and CDC consumers all agree on, you know, what they should be producing and consuming. And then that just leads to a great foundation for kind of all the things that Ali was talking about, just the secret sauces of CDC, whether that be replicating data, all the way to building microservices that actually leverage these events in an event-driven architecture type of way. So right now in terms of open CDC, we’re putting together these standards and this specification. So be on the lookout for something more official soon. But if you have any ideas or something, we would love to hear from you and love to work with you on this initiative to make sure that this is something that’s really great for the CDC community.

Kostas Pardalis  25:20

Yeah, that’s, that’s, that’s amazing guys, like I hope this is going to work out at the end. And obviously, like anyone who is listening to these and is involved one way or another, like in this kind of CDC project, I think they should reach out to you. Is there some I mean, outside of Morocco right now, are there any other partners that you have that are part of this initiative?

Taron Foxworth  25:44

Yeah, one big one is the Vizio itself, we, we talked with the lead maintainer of the Debian project, because I think the cesium as just a project in general has been so influential in terms of CDC, and their format, that JSON specification, it includes things like the schema that is being tracked from the database, and the events and the operation, the things like the transaction number of the database transaction. And in the case of logical replication, right, like the actual wall line they would be reading from. So there have, they have been one group that we’ve been working with and materialized is another, so materialized, they’re a streaming database. And CDC is really important for them, because as soon as you’re streaming changes, and calculating information, that system is very important for how they consume the data, and then produce that back out in a meaningful way. So I think, you know, working with the different types. So when you look at CDC, in general, you might have actual products, such as Postgres producing a CVC stream. But you also have, like CDC services, say, like maraca that’s actually consuming them and get you to do something useful. So I think there’s different types of players and companies that we can begin to work with. But those are a couple of the few that we’ve been having some really awesome conversations so far about.

Kostas Pardalis  27:05

That’s super interesting. Fox, do you see value in having these conversations also, like the cloud providers, for example? The reason I’m asking is because so far, the way that I’ve seen like products that they’re trying to do ETL from like, Postgres, and SQL, and MySQL, depending on the cloud provider, the version of the databases, you might be able to perform a CDC or not, right, so there is no unified experience, at least across like the different providers, the cloud providers out there. Do you think it makes sense for them also to be part of this initiative?

Ali Hamidi  27:41

Yeah, I mean, I think it definitely makes sense. I know, we want to try to get as many people on board as possible. And, you know, some of the ideas that we’ve been talking about is, how can we classify the, I don’t want to say compliance, because I feel compliance is too strong, like the idea of, we don’t necessarily want to enforce a standard, but some kind of categorization of like, good, better best of, like, if you are planning to, to leverage CDC, like this is, you know, a really good experience, or this is like the best possible experience where you get all of the operations you want, it’s very clear, you get the before and after of the event, you get everything you need.

Ali Hamidi  28:18

So yeah, I think from from my point of view, you know, the more people that are involved, the more people that adopt it, the more people that are kind of, you know, following our guidelines, the better the better it will be, and the more likely we’ll have sort of successful interoperability. And so I can definitely imagine a world where these bigger cloud providers are kind of not necessarily changing their formats to match it, but at least, you know, if they’re going to build something, if you’re going to build something new or integrate something, then why not build against some sort of commonly accepted guidelines that you know, benefit everyone?

Kostas Pardalis  28:55

That’s great. I think you’re after something big here, guys, so I really wish you the best of luck with this and also from our side as RudderStack, I think it would be great to have a conversation about that and see how we can also help with his initiative. We should chat more about it. Alright, so some questions about CDC again, and the experience around using CDC, right? You are providing a solution. Right? So it runs on the cloud, Meroxa is like, connecting to the database system of your customer. And this data ends up on a Kafka topic. And from there from what I understand it can be consumed as a stream using, like, different API’s. What are the, let’s say, the expectations in terms of latency that a user should expect by using CDC in general and Meroxa in particular?

Ali Hamidi  29:56

So with CDC, it’s very much dependent on how it’s implemented, right? So, you know, I mentioned previously that one of the things that we do is we sort of degrade gracefully, in terms of what is possible. And so if you point Meroxa at a Postgres instance, that’s running on RDS that you know, has the right permissions and logical replication everything, then latency is incredibly low, because it’s basically building on the same mechanism that is used for replication. And so if you had a standby database, typically that’s potentially less than a second behind, you know, milliseconds behind in terms of like, for replication. And so we’re seeing that data in real time, at the same time as all the other standbys are. And so the answer in latency can be also sub-second. But that’s like the best case. I mentioned with open CDC, like good, better, best; this would be the best tier where you’re really getting low latency, high throughput, sort of low resource impact. But you know, the end to end is obviously very variable, because once it’s in Kafka, Kafka is very, you know, famously high throughput and low latency as well. So that tends not to be the limiting factor. But what tends to be the limiting factor is what you do with that data. If you’re kind of tapping into the stream directly, and using something like the GRPC endpoint, you know, the feature that we have, then you could potentially also get it, you know, sub-second, see all of those changes that are happening on the database. If you move down to something different, like, maybe you’re running Postgres that’s very restrictive, you’ve given us a user that has very limited permissions, and we aren’t able to plug into the logical replication slot, then we kind of fall back to JDBC polling. And so then you’re kind of, you’re looking at the longest, you know, worst case scenario with the polling time, plus, whatever time it takes for us to write it into Kafka. And potentially if you’re writing it out to something else, like S3, or something that is inherently batch based for writes, then you’re kind of incurring that additional time penalty. But typically, what we see entering this is still pretty low, like single digit seconds is quite common.

Kostas Pardalis  32:04

That’s interesting? Do you see practical workloads, where you also have to take the initial snapshots? Do you see issues there in terms of catching up with the data as they are generated using CDC?

Ali Hamidi  32:17

Yeah, that’s kind of an area where I think there’s definitely room for improvement, both in the way we handle things and tools, in general. The initial snapshots can often be very, very large. So obviously, if you, you know, use something like Meroxa right at the beginning, it’s great, because you don’t have that much data. But if you come in and are pretty late in the game, and you have terabytes of data, then that’s terabytes we have to pull in before we can start doing the CDC stream. And so I think there’s room for improvement in terms of the tooling, you know, being able to do it in parallel, or being able to do things like that would be great. And I know, you know, we’re working on things internally, and also sort of the upstream providers, like, you know, Debezium and other teams are also working on things like allowing, you know, incremental snapshots and being able to take snapshots on demand and stuff like that. So I think there, there’s definitely, room for improvement, you know, I’d love for us to be able to like seed, a snapshot, maybe be able to, like, preemptively load from historical data, and then build on top of it rather than only take the snapshot ourselves, and stuff like that. So, yeah, I think there’s still definitely a kind of room for improvement there.

Kostas Pardalis  33:30

Yeah, that’s super interesting. One more question. CDC is considered like, traditionally, something that is related to database systems, and like a transactional database system, like something like MongoDB, something like Postgres, et cetera? Do you see CDC becoming something more generic, let’s say, as a pattern, and including also other types of sources there?

Ali Hamidi  34:03

Yeah. I think, you know, if you, if you kind of squint your eyes a little bit, the CDC event is just an event that describes something that happened to the database, right. And so it’s really no different to, evented systems if you were building out an application, and you kind of emit an event from your application that’s describing a state change. So really, it’s the equivalent in functionality or in semantics. And so here is an event, you know, a state change that your databases experience, versus here is a state change that your application is experiencing. And so our goal or our belief is that really, if we can provide a uniform experience across the two of them, then this, you know, it may not be necessarily cold CDC, because, you know, evented systems as a term has been around for a while. There’s no reason they couldn’t, you know, plug into like, any kind of SaaS application or your own custom application that’s triggering these events that they shouldn’t be treated in uniformly with the CDC events, if you just consider a state change of some sort.

Kostas Pardalis  35:08

Yeah, absolutely. I think the first example that comes to mind, and related to that is Salesforce. Like Salesforce lets you subscribe to changes, actually, they call it CDC to be honest. I don’t know how well it works, but it’s like a very good example of CDC as an interface with a SaaS application. Right? So yeah, I’d love to see more of this happening out there. I think that as platforms embrace this kind of way to subscribe to changes and catch up like things will become much, much better in terms of integrating tools. So yeah, that’s, that’s interesting.

Kostas Pardalis  35:47

Ali, something else about that. Recently, there’s a lot of hype around what is called reverse ETL. So there we have the case of actually pulling data out of the data warehouse and pushing this data into different applications on the cloud. Traditional data warehouses are not built in a way that, you know, like image changes, or even like, allows for, like many concurrent queries, like it’s a completely different type of technology. Regardless of that, though, we see that in examples like Snowflake, right, like Snowflake, from what I’ve seen, like recently, they have, like, a way where you can track changes, right? Yeah, it’s not exactly CDC, but it’s close to CDC, right? Do you see CDC potentially playing a role in these kinds of applications too?

Ali Hamidi  36:39

I don’t know. I think that the jury’s still out on the reverse ETL. I feel like my initial reaction to sort of the whole idea of reverse ETL is, it’s kind of a fix for potentially the wrong problem I think. The reason you know, people want reverse ETL is because you’re, you’re kind of following this ELT idea of dump everything, roll into your data warehouse, clean it up, process it, put it in a state that is useful for my other applications. And then now I want to take the data out and kind of plug it into my other components. But I feel like that’s kind of too far downstream for us. My thinking on the subject is really, if, you know, if ETL in real time was good enough, if we provided the right kind of tooling, the right kind of API’s, the right kind of interface, to do that kind of transformation in real time on the platform, in a way that is, you know, manageable and sustainable, then it kind of removes the need for dumping everything raw into a data warehouse, doing the processing, and then getting the reverse ETL. So an example of this is, you know, because we’re putting everything in Kafka, Kafka has, you know, retention, and, and so we could plug in a connector and say, Okay, take the last two weeks worth of data, apply this processing, you know, summarize it in this way, do the stream processing, and then take those results, and write it into my application. But it also lets you do things like, well, you know, maybe the transformation was wrong, let me rewind again, and try again, with a different transformation. And so I think that the task for us is really to build that tooling to kind of make the idea of reverse ETL almost unnecessary, by trying to build better tooling. I feel like ELT and reverse ETL is really a result of having funky ETL tools or tools that really didn’t meet the needs or weren’t, you know, weren’t really usable enough or performant enough to achieve that. So we’ve kind of gone extreme in the other direction of saying, just get everything rolled into your data warehouse, and then we’ll figure it out later. And so that’s inherently not real time. And our focus is very much on real time. And so if we can, we can provide the right tooling and do it upfront and do it on the platform. I think it should hopefully, if we do it well enough, negate the sort of need for having a reverse ETL.

Kostas Pardalis  39:06

That’s a very interesting perspective. What do you think, Eric, about this?

Eric Dodds  39:10

Well, I was actually going to ask a question. I was going to follow up with a question. So I’m so glad you asked Kostas. I think before we get … the reverse ETL topic is one that we love to talk about and debate about on the show. But I think first it would be interesting, both for me and our audience, just to hear what are the most common parts of the stack that are replaced with Meroxa when someone adopts the products? Or is it generally sort of a net new pipeline? I think it’d be interesting to know about that, and then I can give my thoughts on reverse ETL.

Ali Hamidi  39:45

Yeah, so we, I mean, we don’t necessarily try to go in and replace like, you don’t need to replace anything. Typically, the path for using Meroxa is to deploy us in parallel. And so we’ll tap directly into your operational database and start streaming the data into our intermediate Kafka. And then you can start leveraging Kafka, and the streams and the streaming data in Kafka to build out new applications or, you know, pour it into your data warehouse or whatever it is that you want. And so, you know, we use the term data infrastructure, and try to position it more as you know, we don’t view Meroxa as a point product; it’s not a point to point connection, really. What we’re trying to do is get your data out from the various data sources that it resides in and putting it into a flexible real time intermediate that you can then sort of tap into and leverage for other things.

Eric Dodds  40:40

Yeah, absolutely. Makes sense. And I think, you know, the reverse ETL question is interesting, because it sort of crosses a number of different types of technologies that are connected into the stack. So I think the first thing that came to my mind, Kostas, when this subject came up was the tip of the spear tends to be marketing and sales, SaaS tooling, right? When you think about, you know, sort of data pipelines, you know, whether it’s your traditional like ETL, cloud data pipelines, or, you know, event streaming, type tooling, etc. It tends to … so the demands of marketing and sales to get data that’s going to help them sort of drive leads and revenue, etc, tend to create a huge amount of demand. And so the first round of ETL tools, I think, is really focused on those, right, you’re trying to get, you know, sort of audiences out of your warehouse into marketing platforms, ad platforms, enrich data from your warehouse into your sales tools, your salespeople have better insight. But I think Ali what you … it’s been such an interesting conversation, because the idea around sort of streaming data in and out is much, much larger than just sort of those point solutions. And so I think it’ll be fascinating to see how the space evolves, especially as technologies like Meroxa become more and more common and we discover all the different use cases. It strikes me as one of those tools, even throughout this conversation, where you sort of get an immediate use case. And then you think about all of these other interesting ways that it could be useful as well. Right? Which is so interesting.

Ali Hamidi  42:20

Yeah, for sure. That’s something that we see pretty pretty frequently. Customers will come with a particular use case in mind, like, the most common one is sort of operational data into your data warehouse. But once they have that data flowing, then they have this sort of real time stream of events coming from their operational database, that includes every change that their database has seen. Then they kind of almost immediately go well, now that I have this, I can do these other things. Like maybe I’ll tap into the same stream, transform it, and keep my Elasticsearch cluster up to date in real time while I’m at it. And so then, like, once you do that, then you’re like, Oh, well, actually, I can use this to, you know, make a clone of a web hook that hits my, you know, my partner company, whenever this particular thing happens, because now I don’t need to change my infrastructure, I don’t need to sort of custom instrument anything, I’m just looking at the role of events, and I can kind of tap into it and really leverage that. Yeah, so one of the things that you mentioned, like the reverse ETL idea of, you know, enriching data for use of marketing, I think, we don’t currently have this functionality. But you know, I just want to kind of see the thoughts of the audience and you, is imagine, you jumped forward some about of time, and if we, or someone like us can make it super easy to do cross stream joins and enrich data in real time, then do you really need to pour your data into a data warehouse, and then pull data from Salesforce, and then pull data from Zendesk and then like, join them across the thing, join them across all of the tables, and then wipe them out into something else. Whereas if you were able to do it in real time, you know, by doing no stream joins and hitting third party API’s to enrich those records and create a flat record that you can then plug straight into Salesforce? You know, I think it would be hard to argue that, you know, I can’t imagine anyone saying, you know, what, this real time is just way too fast. I wish it was taking several hours like, this is just too responsive. So I feel like the task is not a question of whether anyone would want that. I think that’s clear. It’s whether or not anyone like us or someone else can make it happen in a way that’s easy to use. I think that’s really the task.

Eric Dodds  44:25

Yeah, you know, it’s interesting, in our last episode, or the one before, we kind of mentioned these different phases, right? So you have sort of the introduction of the cloud data warehouse, which sort of spawned this entire crop of pipeline tools, because now all of a sudden, you needed to unify your data, right? And now you have sort of the next round of that where you’re seeing reverse ETL and you know, sort of different event streaming type solutions. And it’s interesting because a lot of the sort of new technologies spawn new use cases and spawn new technologies. And so I think it is fascinating to think about a future, and this is actually something we’ve been discussing a lot at RudderStack where Kostas and I work where currently we live in a phase where there’s heavy implementation of pipelines. And if you imagine a world which you talked about Ali, where the use case is the first class citizen and the pipelines are an abstraction that you really don’t even sort of deal with in terms of setting up a point to point connection. I think that’s where things are going. And I think the type of sort of cross stream joints you’re talking about are fascinating, because then you sort of get rid of all of this manual work to create point to point connections, which still, I mean, it’s very powerful to sort of do all of that in a warehouse. But if you can abstract all of that, and just give someone a very easy way to activate a use case, and not have to worry about the pipeline’s because all that’s happening under the hood. I mean, that’s, that opens up so many possibilities, because you get so much time and effort back.

Ali Hamidi  46:15

I mean, for sure, you know, you hit the nail on the head there. That’s really the use case that we’re trying to address. That’s the problem that we’re trying to solve, you know, and that’s the world that we’re trying to head towards.

Eric Dodds  46:26

Very cool. Well, unfortunately, wow. We are actually over time a little bit. This has been such a good conversation. Well, thank you so much for joining us on the show. Audience, please feel free to check out Meroxa at Meroxa.com. And we’ll check in with you maybe in another six or eight months and see where things are at. Thanks again.

Ali Hamidi  46:45

That sounds good. Thank you so much for having us.

Eric Dodds  46:47

Well, Meroxa is just a cool company and now having talked to three people there, they just seem like they attract really great people and great talent. So that’s always a fun conversation. I’m going to follow up on their answer to my initial question. And I thought it was really interesting, some technologies, you know, let’s say you change data warehouses, or you change some sort of major pipeline infrastructure in your company, that can be a pretty significant lift. And it was really cool to me the way that they talked about how their customers are approaching, implementing CDC, and it really was around if you need to make some sort of change or update to some sort of data feed, then you can replace that with Meroxa. And so that’s what they see a lot of companies doing. And I think that makes CDC a lot more accessible as sort of a core piece of the stack, as opposed to going through some sort of major migration. What stuck out to you Kostas?

Kostas Pardalis  47:44

Yeah, two things. Actually, one is about this great initiative they have started, which is the open CDC. I’m very interested to see what’s going to come out of this. Just to remind our listeners about it, it’s about an initiative that will help standardize the way that CDC works, and mainly about messages and how the data is represented. So it will be much easier to use different CDC tools. Anything that is open is always like a step forward in the industry, it remains to be seen, like how the industry and the market is going to perceive that. So that’s a very interesting part of our conversation. The second one was about reverse ETL. And the comment that Ali made that actually, if you implement CDC and ETL in general in the right way, you don’t really need to reverse ETL. It’s very interesting; a little bit controversial opinion, if considering, like, how hard to reverse ETL is right now. So again, I’m really curious to see in the future who’s going to be right. So it was a very exciting conversation. And I’m looking forward to chatting again with him in a couple of months.

Eric Dodds  48:55

Sounds great. Well, thanks again for joining us on The Data Stack Show, and we’ll catch you next time.

Eric Dodds  49:02

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds at Eric@datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com

38: Graph Databases & Data Governance with David Allen of Neo4j

In this week’s episode of The Data Stack Show, Eric and Kostas talk with David Allen, a partner solution architect at Neo4j. Together they discuss writing technical books, integrating something like Neo4j with an existing data stack, and many different use cases for graph databases.

(more…)

Blog Posts

March 2021

March was an exciting month with inspiring guests talking about the current state of data and predicting how exciting the future of data will be.

Read More »

February 2021

The February playlist includes interesting episodes with Intuit, Tecton, Policygenius Inc., and an interesting chat with Duc Huba, an AI researcher and enterprise mobility solution

Read More »

January 2021

Our January playlist included interesting episodes with Immuta, Homesnap, and, Iteratively where Eric and Kostas discussed the following topics: Enabling fast, efficient, and understandable data

Read More »

December 2020

Happy New Year! Our December playlist included a follow-up episode with Earnnest and interesting talks with data leads from Netflix and Bind. Our duo discussed

Read More »

November 2020

RudderStack Head of Customer Success, Eric Dodds, and Head of Product, Kostas Pardalis are excited about the launch of The Data Stack Show. It’s their

Read More »