This week on The Data Stack Show, Eric and Kostas chat with Boris Jabes, the CEO at Census. During the episode, Boris discusses what ETL is, where it came from, and how to work with it within different companies.
Highlights from this week’s conversation include:
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 0:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.
Welcome to The Data Stack Show. Today, we’re going to talk with Boris from Census. And it’s categorized as a reverse ETL tool. But I have a sneaky suspicion that costus is going to ask about the reverse ETL terminology. But what I’m going to ask about is, you know, what’s interesting about Census so you know, taking data from the warehouse and pushing it out to other tools in the stack, is that it kind of assumes that there has to be some value created in the warehouse beyond just the raw data that was loaded there, however. And so I want to know, what Boris is saying, as far as how, you know, how does that impact the way that he thinks about customers, the product that they’re building? And the ways that companies are trying to do that? Right. I mean, DBT is obviously sort of a new way. But I’m really interested in that. How about you Kostas?
Kostas Pardalis 1:14
Well, first of all, I have to figure out who came up with the term “reverse ETL.”
Eric Dodds 1:18
Yes, the etymology of tech terms is such a taste.
Kostas Pardalis 1:23
Yeah, I mean, it’s more of a marketing term, probably, to be honest. But it’s something that like, because I have also this suspicion, I mean, you know, like sensors, were probably like the first company that was like in this space, I mean, so it probably has to do with them, like it’s up to them. So I want to learn like what’s the story behind it? And outside of this, I want to ask Boris and like, try to understand what’s the difference between getting data, for example, from Marketo, and pushing into the data warehouse and doing the inverse, which is data warehousing push it back to market? Like, where are the different challenges there? Why they’re different? Why do we need different tools? And who is using is the user of the same? Like, why do we have different product categories have been? That’s what I want to Yeah, understanding I hope he’s the right person to have this conversation.
Eric Dodds 2:18
Well, let’s go find out.
Kostas Pardalis 2:19
Let’s do it.
Eric Dodds 2:21
Boris, welcome to The Data Stack Show.
Boris Jabes 2:23
Hey, nice to be here.
Eric Dodds 2:25
All right, give us the brief background on where you came from and what you do today at Census.
Boris Jabes 2:31
Where I came from? So originally from Canada, if that’s the real meat of the question, it’s mainly a geographic, yeah, it’s a geographic question. I’m a Canadian, who lives in San Francisco through a variety of stops along the way. But my career started at Microsoft, I have always been a tool builder. So I started my career on what I consider kind of the ultimate tool, which is Visual Studio, which has, you know, is the tool that tool builders use to make software. So it’s a, it’s a particularly interesting challenge to start your career in. And I spent quite a few years working on developer tools. And then about a decade ago, I started my first company that was actually in the field of what you call identity management, and single sign-on for the people that kind of know these things. And that after I sold that company, it kind of my brain stayed tuned to what I call, you know, kind of data, Silo problems, data Federation problems. And very quickly, kind of re-centered on this, this problem that, that we attempt, like, started to solve in 2018. With Census, which was to get kind of data from product and analytics teams out into the rest of the business, we were just frustrated by the lack, like bridging between those two worlds. And so that’s, that’s how our company is born. And so today, I’m the CEO of Census. And, you know, we’ve, we’re mostly based in San Francisco in the US, I think, kind of a mix at this point of like, 50/50, kind of remote and in San Francisco, and, you know, kind of humming along?
Eric Dodds 4:06
Yeah, one quick question on the way that you notice problems around data silos and other things. Was that both in your company and with your customers? Or was it primarily something you learned building the company yourself?
Boris Jabes 4:25
I guess I see it everywhere. Like, once you see you can’t unsee. I think great startups, great founders tend to, they don’t look at like, just, you know, they don’t look at the world, let’s call it from the MBA perspective. And like, ah, there’s a market opportunity there. They just want to build something. Right. Right. And which I don’t, I don’t knock like my identifying market opportunities. But I find that you tend to get obsessive about trying to solve a problem, either that you’ve experienced or that you see and you can’t unsee in my case, it was both. So when I was When I see Software as a Service, and I see people using all the amazing apps, right, some of our customers have, like 300 apps, if you can imagine that in their organization. I think that’s wonderful, right? That means that lots of people get to use the tools they want, people can be productive. There’s, you know, people have best of breed user interfaces and all that stuff. But invariably, and I, maybe other people don’t see it as immediately as I do, but I just can’t not see it is – data is replicated ad-hoc across all of these applications. And what is the data? Well, it’s the same kinds of things over and over again. And, and that feels wrong to me, right. And I feel like we need to help solve that problem. And so there are all sorts of tools that have existed over the decades trying to solve what people called Data Integration. It’s not like a new concept. And the kind of unique perspective we brought to it. When we started the company in 2018, was there there was this treasure trove of data in data warehouses, and product analytics teams and product teams, that everyone on the product and engineering side used? We all were very comfortable using those things, whether that’s from your Operator Console, or from your amplitude, analytics tool, whatever, right? Like we were all living and breathing it. And sales and marketing and Success and Support Teams were not. And so we built this bridge, right? That went from the data warehouse out towards the business tools. And then 2018. That was a weird and novel thing. So people didn’t even know what to kind of call this. Yeah.
Kostas Pardalis 6:31
So how did we come up with the term “reverse ETL” and who came up with this term?
Boris Jabes 6:37
Yeah. So when we first started, this was in approximately August 2018. August, yeah, August, September 2018, is when we were building the first version of Census. And we’re talking to our first customer, the our customers, our first two, three customers, basically, on their own decided to describe our product as reverse Fivetran. Okay, if I were really specific, because a new ways, that’s great. And so they did that, right. We were just kind of like, again, you’ll meet a lot of first-time founders or you’re like, like, how do you describe your product is a classic conundrum. And people get too complicated these buzzwords. So we were just like, connect your data warehouse. And you know, we were keeping it real simple. And we weren’t trying to complicate it with buzzwords. And then they were like, Oh, so it’s kind of like Fivetran in reverse. And we’re like, yes, that works for you. That’s great. Like, let’s go with that. Right. Now. Of course, we didn’t put that on our website, that would seem really weird. But, but in, in colloquial speech, like, that’s how people were reasoning about our software. And, obviously, you’re not going to launch your company that way. So, you know, in our first year, back in 2018 2019, we were just going around finding our first customers just, you know, getting them on the product, and, and riffing on all sorts of ways in which we could call this. And funny enough, around June, I’m gonna say June, July, August of 2019. Around there. I started. Yeah, 2019. One of our customers was actually working in tandem with folks at Fishtown Analytics, which is now DBT labs. Yep. And they were actually, for the folks who might not know because now it’s feels like ancient history. But the company that builds DBT was originally selling consulting services rather than selling the software. Yeah. And the, so one of our customers was consulting with them. So they were, they’re paying for our software. And they were developing really cool, a really cool data stack. And they were working with one of the folks at Fishtown, which we knew the company existed, but we didn’t know anyone there at the time. And she was one of their true, you know, kind of, she was their first you know, almost one of their first consultants, she became kind of the community manager, her name is Claire Carroll. And she started taking notes on what things she was seeing out there when she was working with customers. And so out came like sometime in the summer of 2019, this notion doc, that was like, you know, linked off of the internet somewhere, right, which is long disappear, in which she was kind of taking notes. It was literally a page of just notes. And in it there was this thing going like and then there’s this Census thing like reverse ETL, which from her perspective, it’s like instead of branding at reverse, I’ve tried to reverse it made sense to just say, Oh, it’s like reverse ETL. So that’s the first evidence of that word ever showing up in writing in my, in my to my knowledge, so the reason we weren’t using that term at the time was I have the unfortunate problem of being like too knowledgeable or too nerdy or too mathematically obsessed or oriented, which is like, the word is technically a misnomer. It’s ETL has no direction. It was weird. At least Fivetran in reverse actually was a reasonable descriptor, right? Yeah. But reverse ETL actually seemed like a mathematically incorrect way of describing the thing, but at least it’s a generic term. So so, you know, it was kind of, we banded around, like for a while for fun in 2019. And then, and then we launched the product and the company in 2020. And it just very quickly became the de facto name for this, and Far be it for me to kind of argue with the public, right, it doesn’t seem like a worthwhile way to spend my time. So my personal recollection of the kind of birth of that, that word, and then, you know, when we did our series, an announcement, which was in February of 2021, this all these last couple years are all blending together, the then the VC ecosystem, landscape machinery kind of kicked into high gear. And they, you know, in the same way that engineers like to think about data stacks, and like venture capitalists like to think in terms of data landscapes, or landscapes. Everyone famously knows the marketing landscape, then now the data landscape is just as complicated and, and so you know, this is like the kind of output they like to produce. This is like a success for them. Like, I’ve managed to put every logo I’ve ever heard of into a single chart with squares around them. So that’s when reverse ETL really became household concept it. It’s going to start showing up in those.
Eric Dodds 11:29
That is some high-quality lore, like, the detail of the Notion doc is perfect. Like, it’s perfect. So that’s I thank you for that bit of history. Okay. my follow-up question to that, Kostas, about where the term came from is? Okay, so I agree, like, mathematically, it’s not technically accurate, but I think even beyond that, my bigger question is, in some ways, it’s very singular, right? Like a line on the chart, you know, that, you know, whatever us in the data industry create, or an investor creates, but you’re building tooling in this space? Do you think that’s a sufficient term to describe at least what you like what you envision that you’re building? Or like the problem you’re solving?
Boris Jabes 12:20
Hmm, no, I mean, you’re giving me really too much rope there to say whatever I want. But that’s the point of the show before this call, right? Kostas? Describe yourself as a plumber since he’d worked in pipelines for so long. And I think there’s great pride to be taken into building excellent data pipelines is something that we pride ourselves on, and I’m sure you do as well. And our customers do, but it’s not what I think the product is actually about. It’s not what excites our users, right? When I think of great software, especially tools. I mean, there’s software’s and software of all kinds, right. But when you think of great tools, you’re actually, you’re basically trying to make someone else write your user kind of more awesome version of themselves, right, that’s just the best way to think about it. And our users are not trying to become really good data pipeline people. That’s not, that’s not their goal. And when we started the company, I was not thinking, you know, what I’d love to do is just spend my life building great data pipelines, like that’s not what the core animus is absolutely an essential means to reach our end. But the, what I wanted to solve, and what I get to see with our users every day, is I wanted to bridge the gap between what I called analytics and product organizations and the go-to-market organizations like that. I was very frustrated, but that gap existed. And there are a lot of tools out there that had taken stabs at this, right, famously, you know, there were tools like segment that connected the code that you wrote in your app directly into your marketing tools. This was a huge step forward. But I kept seeing this problem, the data organizations that were emerging, the BI organizations that were emerging, were disconnected from the rest of go to market, right? Finance, support sales, like just the whole world of the company. And so that just building that connection was important to me. And you don’t just have to build data pipelines to make that work, right. You have to change the relationship between those teams, and the data organization. And if you ask data teams all over the world and you ask them what their day-to-day life is like, they will tell you that they’re really crumbling under kind of load, like support load of getting data requests having to solve like a yet another dashboard. They’re very overworked like IT teams right. And what I felt they needed to move towards and what I think Census is underlying goal should be for them. is not to make pipelines that run faster than the pipelines they could. That’s a good to have. And I’m glad that our pipelines are superior to the ones you would build yourself. But actually, to turn your data organization into a, we use the term a lot nowadays, right? But we really meant it from the beginning, which is like, a kind of product or platform team. Because it’s the only way to serve your whole company at scale. Otherwise, you’re just the hated service. Org, right? You’re the IT team that no one really likes, because everyone’s always stuck behind 32 requests. Yep. And, and so that was a huge kind of part of what Census has always been about and continues to be about, which is so see, it’s not like really about the plumbing. It’s about saying, How do I turn the data team into the most essential part of your whole company that everyone else depends on. And, and so that’s, you know, I kind of you may have caught me say this caught me saying this earlier, but I think of senses a lot more as a data Federation tool, rather than a data pipeline tool. That’s why it’s called Census because my goal is to say, at a company, there should only be one version of the truth, there should only be one Census of your users, your data, etc. And everything else in the company should be naturally kind of a cache on that data, pulling from that information as seamlessly as possible. And then that’s what Census does.
Kostas Pardalis 16:22
Boris, can you elaborate a little bit more on how this Reverse Fivetran ETL (whatever we want to call it) is actually different? And what are the challenges that Fivetran does not have? The data from the apps or the database and put it into the data?
Boris Jabes 16:43
Totally, totally. Yeah, I mean, it’s a great, that’s a great question. And everyone, you know, from the outside of almost any company, any software, any tool, right? People always think it’s How complicated can it be? It’s reverse Fivetran. Right? So as soon as you distill things into like two words, it’s like, then you somehow lose all the underlying complexity. So so there are a couple really significant ways in which this is different, and, you know, difficult in its own right for people to build. The first is, when you’re, when you’re pulling data, from SAS applications into your warehouse, you’re actually dealing with very consistent source data, right? So if you go to, you know, all the various ELT tools, right, they’ll show you the ERD for all these applications, right, and they’re fairly stable. And what you’re doing is you’re saying, let me get Salesforce, and we pull the schema and like, dump it into the warehouse. And warehouses, to their credit are very easy places to say, here’s a table just dump it, right. I’m not trivializing the work of building great pipelines there. But you’re basically going from a kind of raw data structure that is not changing super often, with read API’s off those products that are generally the first API that any SaaS product will build down into a data warehouse, which is of a low n, right? There are only so many data warehouses that are fairly consistent at being able to write a raw table in right. And then all the little details, of course, emerge of trying to get that just right and incremental cetera. When you’re thinking about this in reverse. The first thing is everyone’s data models are different. Right? You’re at the end of the data refinery. So it’s not the raw data for Salesforce, that’s always the same schema. It’s whatever entities your company has evolved, right? What your data organization thinks is essential about your users and your workspaces. And maybe we have maybe you have a many to many model of your user base versus maybe you don’t, right, maybe it’s one to one or if there are no organizations, it’s all just be this b2c, right? There are all these various patterns are bespoke to your company. And Census sits at the— that’s where Census starts, right, it has to first take your distilled version of the data, at the end of all your, you know, try pipeline transformations, and say, Okay, we’ll work with this, right. And then we have to write into applications. And there are two problems there. One is writing data. The API’s are terrible, because most SaaS applications focus first and foremost on like, easy read API’s. And the right API’s are very heterogeneous, very generally very poorly designed. And then if you screw that up, the damage is really, really high. Yeah. So I think that is the most important aspect of this. So when you think about a product like ours, even if you were to do this yourself, right, so you’re an engineer at your company, and you’re going to build these things, you will generally be reticent to do a lot because your upside is like I got the pipeline done, who gets promoted for that? And the downside is very significant, right? Because you’re gonna accidentally put a million things into Marketo that you weren’t supposed to put in, and no one knows how to delete those things. Guess what deleting is hard in SaaS. And, and so now your marketing team is angry. You’ve sent emails to the customers that are wrong So the downsides are very high. And so a lot of what I think that’s actually what generally held back this side of the company, this is why the product and analytics. That whole world was actually evolving very well, because it’s agile. But this side, it’s like one project a year, one project a quarter, right? Yeah. And so that’s really what we were trying to change here. And, and so what do you have to do, you have to validate data more deeply, you have to, you have to do a lot more fine-grained ways of like writing data. And so we have, you know, all sorts of different capabilities, you can use Census to say, Hey, I only want to update what’s there, I don’t want you to create new stuff. Or, you know, I want you to write into Salesforce. But I also don’t want you to overwrite this field if it’s already there. Because again, there’s much more subtle stuff going on, when you’re, when you’re in these operational workflows, like, there’s, there’s an email that’s gonna come out automatically edits, right, there’s a salesperson is gonna make a phone call an hour later, based on what’s happening in there. And so we have a lot more like subtle capabilities to ensure that you’re not breaking your operational world. Yeah. And so, you know, one way to reframe what senses as opposed to pipelines is actually kind of a continuous deployment tool for data. And it has all of the, you know, the needs there.
Kostas Pardalis 21:20
Yeah, 100%. And actually, I want to, like extra emphasize what we are saying about, like, the difference between reading and writing from a source application, and something that I want to add and like, make sure all of our audience like is aware of is that actually Oh, by the way, Claire did something very right. She named it ETL and not ELT. That’s very, very important. Like, the fact that we can do ELT, which means we extract whatever where we can, and just load it and dump it there. And then we can have models that we version on DBT, or whatever, we can go back and fix problems. If we have problems. It’s huge. And we don’t realize that if you go like to an ETL engineer that was working, I don’t know, before I build systems 30 years ago, they had the same problems that you had, because everything was so costly, but transformation is something that can destroy something new, especially if you do it like on the fly. Right. So as exactly as you said, like it’s a completely different I mean, mathematically, it is like the same thing. But in terms of the engineering that you need to put there.
Boris Jabes 22:35
Yep. I think a lot about product as an experience as well. And if you think of the user that is trying to pull data into a warehouse, right, that that ELT scenario that we’ve been all kind of very familiar with for the last decade, if you think about what they’re trying to accomplish, almost all of them, it’s in the name, right, it’s analysis, they’re trying to pull data in, so they can do some kind of analysis, how much money did we make? How much money could we make? Yeah, like, usually comes back to one of those two things. And, and, and so the use case is very, like, there are lots of kinds of analysis, but it’s analysis. Whereas in our user, analysis is not the goal. The goal is operations, right? It’s automating something, it’s, Hey, I want to send emails to send a promotion about a shoe that you should buy, but tied to the specific segment of users that are likely to not retain if we don’t send them the shoe, etc, etc, right. And so, you’re trying to get fine grained detail into your email system, but not to do a spreadsheet, right. So that an email comes out, or a sales call comes out, or a better support experience comes out, like that is very different, and user need. And so I think when the person wakes up in the morning and opens up our tool versus opens up an ELT product, they’re what they’re thinking about is different. Like, I think they’re actually just trying to solve different problems.
Kostas Pardalis 24:01
Quick question before Eric asks. Are the users different between the Fivetran user and the Census user?
Boris Jabes 24:12
Yeah, I mean, I’m sure you see the same thing as I do in terms of data teams range dramatically in size. So I admire the crap out of like, a lot of our users who are you know, data teams are three things in one body so to speak. And so, they are this, they pull the data and they model the data, they push the data, they do all of it in their own it all on their own, but, but I think when a data team grows, it actually ends up being different people Yeah, because the there is a user who is you could think of it as like almost like maybe the concept. What do people get? You remember us talking about the for deployed engineer, remember that concept? Was a Palantir that first started using that term. I think Think data teams now have all sorts of roles, right? There’s the core platform-building kind of people. There’s ml who, you know, people just sitting there doing, like really cool analyses that, hopefully are worth money, I don’t know. And, and then there’s this kind of, for deployed analysts, let’s call it your job is actually not just to sit there and pontificate on what is revenue, but actually to go help the marketing team, the sales team, the support team to do to, to improve the operational excellence of the company. And, and so yeah, I think that person might, on a different week be doing something related to fire training analysis, but on a day to day, I think, at scale your data team, this is actually different sets of people.
Kostas Pardalis 25:40
Eric, all yours.
Eric Dodds 25:41
You saw me chomping at the bit. Boris, I’m interested in what I’ll call maybe like the chicken and egg problem a little bit. And I’ll lead in by I was thinking the other day, like Google Analytics is still so pervasive, but relative to what’s available now. It’s so primitive, in many ways. It was a little bit better. I was thinking about it, it’s like, okay, well, part of the reason is because like you have sort of packaged collection and visualization. And disaggregating those things creates really big challenges on both sides, right? And so like, okay, just people kind of go to it. So you think about five Tran? And it’s like, okay, well, I’m taking, you know, data with largely known schemas and dumping it into a place that can ingest data schemas, like, you know, whatever schema ism is great. When you think about, like, the practical, I want to send emails, or I want a salesperson to prioritize something. There’s an assumption, I think, that there’s been some sort of value created beyond the initial dump into the warehouse. Yeah. And I’m just interested to know, like, how do you approach that is, because every business’s data is different, different metrics. So you know, all that sort of stuff. Are you like reaching into the warehouse and trying to enable the creation of that value? I mean, tons of companies are doing it with CBT. But like, in many ways, you need to have something to send that isn’t there. When the data arrives from Fivetran?
Boris Jabes 27:15
Yeah, this might be my favorite question and topic and thing to think about. You have to generate some kind of IP.
Eric Dodds 27:27
That’s a more succinct way to say it.
Boris Jabes 27:30
Yeah. So I think of a company has two kinds of IP, there is the widget that you make, and how you sell it, and market it and support it. And, you know, all the kind of those are both a kind of IP, right? And our industry focuses like 99%, on how to how to make better widgets and how to the source code is your ultimate IP and all these things. Yeah. And I think all of this, call it you know, how the sausage is made, how it’s sold, how it’s supported, how it’s marketed is absolutely IP? And if you have none, if the way you send an email about, you know, promotions about your shopping cart are, can be solved by, you know, your stripe, you know, automatic shopping cart reminder checkbox, if I don’t have to have that. But let’s say they did, then great, then you don’t need any of these things, right? Like you have no, you have no IP of your own. Right. So I guess that puts the onus a little bit on companies actually thinking about what makes them unique. But here’s the this is, here’s what’s happening, and has been happening for years. Now. I think your point about Google Analytics, being kind of all encapsulated is actually a really good metaphor for this entire modern data stack. Right? We tend to think about the modern data stack as all these various tools and phases right in the data comes in, and they’re just transformed. And it’s all these things. But in a way, the modern data stack is taking every single SAS app and putting them you know, making fault on their side, right? It’s like, so Google Analytics, ingest data, stores data, renders, like visualizes data allows you to query the data, and reports on the data, like, it models the data, right? It has everything in the app. And the repeat that times 1000s of applications. And so as long as everything you need, can be done inside that silo, then those products are great. And what the modern data stack does, in some ways, it’s just like reinventing that. It’s like, well, now we can ingest all applications into one single storage layer, okay? And then you can store everything in one place, you can visualize it all in one way. So is that a useful architecture versus, you know, 30 apps that each implement their own end-to-end data store? And I think the key question there is, does your IP involve joining data? And if it doesn’t, then this entire modern data stack could actually be, you could potentially throw it outright? And be like, we have a billing system, all of our information about how much money we made is in the billing system. You can query the bill is all that matters is then the question. Does the billing system give me like an interface that I can render in visualizing query? And if they don’t, then of course, then you need to pull the data out. So you can query it right. But see, this is I think, the transition, once upon a time, people were pulling data out into their database, their data warehouse, because you couldn’t query stripe using SQL. Right, right. Yep. But that’s going to change all of them are going to increase how they make their data IQueryable. But what you can never do is from inside stripe or Google Analytics, join and query data. Right? So that’s not possible. And so that is what uniquely the data warehouse in the data stack. does. So then, is there insight? Is there insight for your various teams that comes from joining data together? Well, in the real world, always right? The your sales presentation example, or your marketing email, right, those two examples, you could tie that to product activity? Well, that’s one source of data. That’s assuming your entire product is one database, which it almost never is nowadays, right? So it could be multiple services. It’s going to also be tied to financial information about that customer, which comes from what well, some kind of invoicing data, right, which might be one billing system might be multiple, right? It’s going to be tied to their level of engagement with your team. So that might be your support data is getting joined into that as well. And that’s just me kind of, like rattling these along, right? I bet you the best companies have really interesting ways of modeling, you know, their users, their customers their value, whether that’s to forecast it, or to automate it, or whatever. So I think the longest part of it is yes. When you use Census, you the goal is not to just take something from five Tran into your warehouse and then back out and sell those with no, with no intermediate step. If, if then I don’t know what you’re doing. Then you’re just you’re getting the base value, which is like I can take something from one hop and put it into another app, which is still good. Right? So take like a Zendesk metric, dump it into a warehouse, and then take it from the warehouse and put it into Salesforce, like, that’s still something. And I actually think it’s a better architecture than connecting those apps directly. Yes. Have you at least have a hub? Yep. But I think real value Yeah, what that person again, if you’re just setting up a pipeline, that’s rah, rah, then, yeah, your job is not that interesting. Yeah. But the reason we employ data teams is is that they’re actually sitting there going, I think I could take these disparate pieces of information and clean them, distill them, merge them and come up with new, valuable insight.
Eric Dodds 32:58
One quick follow-up question because I want to leave enough time for Kostas to ask about the term “data federation” because he and I talk about that all the time, and some really interesting thoughts. I love the paradigm of IP. What are the ways that you see companies creating that? The context behind that question is, some of the most interesting ways I see that happening is through tools like DBT, where you’re creating like, interesting models. Of course, I think there are a lot of companies who just maybe even write sequel on the warehouse to perform the joins to create those without a doubt. Tons. What else? Are you saying? No, like, how are companies creating IP? Is there anything interesting in the way that that IP is being generated in the context of those joints? Right.
Boris Jabes 33:47
So I think it’s always helpful for someone to step back and remember that we are very, very, very deep in the most cutting edge sophisticated companies. To your point, Google Analytics is still so widely deployed. Yeah. And so the majority of this does not happen in DBT does not happen in all these places. But there is business logic everywhere. There’s business logic. So there’s the query that you wrote ad hoc in your database. Yes, there is. There’s, if we’re to be really honest, probably the largest repository of these kinds of this kind of logic, this kind of query is not in DBT, and GitHub, which is I think that’s what’s great there. Is it starting to become a better repository for this. I really hope our entire industry moves towards that model. But it’s probably and don’t freak out in Salesforce Socko queries and Apex code.
Eric Dodds 34:49
I agree with you wholeheartedly.
Boris Jabes 34:52
I think the traditional you know, kind of if we think about the sophistication stages, right, they’re crossing the chasm. et cetera, et cetera, right? Silicon Valley. And broadly speaking, software companies have moved to this new paradigm right now, because their most important signals come from their software, and your CRM doesn’t store that. So. So the data warehouse is the is the perfect kind of query, engine and storage and computation layer for that information. And the number of signals that we generate, I don’t even know how many events, the average, you know, kind of software company generates now, but it’s a lot, right like that. That is why we, we store these things there now. But if you think of non-software companies, which again, eventually everyone will be a software company, right, so so this is why it’s like, we all skate to where the puck is going. But there is still furniture companies in the world, right. And you would probably find that the bulk of the intelligence, the IP that I’m talking about, lives, kind of glommed on to their Salesforce instance, in a collection of maybe checked in probably not checked in code that looks like query sometimes, like Salesforce has a query language called talk, or it’s more imperative code, like apex and the real goal of Census is to kind of move that into a kind of get back kind of open standard language called SQL. Yeah. And yeah, that’s, that’s, I think, the journey that we’re gonna see over the next end, but it’ll take like, I’m talking like, easily a decade plus, oh, sure. Yeah, people we all in our industry and why we’re so exuberant and why we all raise all these capital is like, you think these things happen much faster than then than they do? You know, I started my first company like, like, 10 years ago, on a very simple premise that was about if we’re all gonna live in SAS, you need to have your employee identity, your password, your login, like centralized in February, right? It seems to make sense. Can’t have 8000 passwords, right? In a company that’s not checked? Like, that doesn’t work? Yeah. It’s been over a decade. And we’re still in the infancy of that market. Yeah. Like, that’s how long these things take. And so I think data were very much in that early stage.
Eric Dodds 37:09
Back when I was doing consulting, we used to joke about companies of all types and sizes, is like, Okay, I’ve never seen a sales force. That’s not like some sort of Frankenstein. And it’s easy to talk down to that because it’s actually very painful. Like, it does create pain. But in reality, like, it’s pretty advanced for a lot of the companies doing it and enables them to accomplish things that are really like, what else can they do? I mean, of course, like the modern data sack, but like, it is very helpful. And it is pretty advanced to be able to customize all this business logic inside of the tool. So that’s such a helpful perspective.
Boris Jabes 37:53
Yeah. And I think there’s gonna be this interesting cascade. I think the data community has so much still, and it’s exciting, that’s why a lot of us work in this space. There’s so much to distill from the world of engineering of software engineering down into, let’s call it, the broader world of data. So now, thank goodness, but like, we’re still the early days of everyone realizing that you could treat your queries as a piece of code that can be versioned. Right? Yeah, that’s still it, we’re still the beginning of that, right? And then there’s gonna be all the other things that go around the software development lifecycle for data. And even there, we have to get quite a bit more sophisticated, right? If we’re going to support these kinds of workflows. So I’ll give you an example. One of the reasons you’re— because Cascade is like software engineering, let’s call it to data organizations, and then down to business organizations. So if you think of that Salesforce that you were you saw in your consulting days, everyone always says, right, it’s a mess, it’s a mess. It’s got all sorts of stuff. There’s like a field called blah, blah, blah, underscore to you know, it’s like this. But what, how many people in the modern data stack actually run like something equivalent to a migration when their data schemas change? Right? Very few, if not none. And so we still have to, you know, get more sophisticated in how we manage data in the core, let’s call it but as we do, I think a lot of that will then be able to have this amazing downstream effect on on the rest of the business. Yeah.
Kostas Pardalis 39:30
You really made me think, Boris, with the comment that you made about Salesforce and the business logic there because you remind me of something extremely painful which is if and how you can replicate the results of formulas on Salesforce’s warehouse. I don’t know if Fivetran is doing it today or if they figured out how to do it, but it’s pretty much impossible because the piece of logic there, which is executed whenever you make an API call.
Boris Jabes 40:08
That’s a beautiful microcosm, by the way, of this whole thing. You’re absolutely right. You’re absolutely right.
Kostas Pardalis 40:14
Yeah, and that’s like the thing, I think that’s what justifies and makes this category of rivercity, or whatever we want to call it like important, because at the end, you might be able to export the data from Salesforce. But the business logic is not something that you can export, like you need someone to replicate it, which is a completely different story. So you need to get the data out, but that’s not enough. You need also whatever you are going to do with this data to push it back again, right? Many times I say, like, when you get a salesperson, you can ask many things from the salesperson but like, you cannot ask them to leave the Salesforce. That’s where they live. They don’t care about that stuff. The only thing they care about is the water and like, and that’s what they should do. Like they shouldn’t care, like why they should care about, like whatever assigned technology we have. It’s like liquid engineers.
Boris Jabes 41:12
Kostas, but there’s versioning, man. It’s awesome.
Kostas Pardalis 41:14
Ah, yeah, sure, sure.
Boris Jabes 41:22
No, absolutely. Absolutely. It’s a people. What’s the term? People live in their pane of glass, right? And it’s just like, you can’t get them out of there.
Kostas Pardalis 41:32
And I think they were like some attempts to like do that with stuff like Luca, for example. Yes. His version of like, BI tools where we’re like, yeah, ask your salespeople to go and like work from within looker. And then there will be links to go back to Salesforce like, no. Why?
Boris Jabes 41:50
You know who suffers from this the most? It’s actually kind of tech founders in the valley, because they start their company and they’re like, Yeah, I got looker. My salespeople are just gonna go there. And it’s because like, they’re also deluded. Because they see this as easy, right? Because you and I can do it. And I’m like, No, they’re not man. They’re really not. I promise you. They’re not. And he’s like, it’s easy, like, for sure they’re gonna do that. Like, I can do it. And I was like, ah, ha, ha ha, this is sometimes takes years for them to realize like, oh, yeah, I hired a VP sales. Yeah. And like, they ended up doing their own thing. I’m like, Uh, huh. Thank you there. It’s, it’s so I think, yeah, tech founders, particularly, I think suffer from not seeing this.
Kostas Pardalis 42:31
Yeah. Because it’s also like extremely easy to burn money. Actually, it’s like one of the reasons you exist. So being 50 grants to buy like a license for the experience, right? Yeah. Anyway, that’s, that’s another very interesting conversation that we need to do some at some point. But yeah, that was like, very, very, like, interesting point that you made there. But I want to go use like the term Federation, Eric mentioned, but I want to ask about that. But traditionally, like from us, like an engineer, like Federation do like two completely different things. Actually, the opposite, like when you’re talking about Federation is more about No, I’m not doing like to collect the data into one place, I’m doing like to ask its data source, and then I will federate the results and present the results. So if this is like, What’s your thinking of like a solution, or unless you have like a different definition, I would be more than happy to discuss about that. Where do we stand today? And where do you see going? Right, like, because today? I don’t know, like, technically speaking, this is not Federation.
Boris Jabes 43:33
No, no, no. I think that’s a very reasonable technical pushback. So let me start with an analogy I tend to use with my team, but it’s gonna make, you’re gonna appreciate because I think you’re close enough in age to me, but I’m starting to notice that like, younger people are like, What is he talking about? So your laptop, your computer has an operating system in it. And it provides a lot of things for you, the user and for the applications that are built on it, right. And I, I think that when we move to the web, there are certain things that we kind of lost along the way. We gained a lot, so that’s fine. But we lost a few things along the way. So one is login, right. So when you log into your login once and then like, you don’t open Word and go, please log in. You don’t open Photoshop, everyone says please log in. Please, with caveat that everyone now has a web app. So like, that’s different now. But let’s put a pin in that. So so that was you know, your identity, your user identity was just given as part of the operating system to all the other applications. So so the they just were receivers of that knowledge and just used it in the same way. There’s a file system in your operating system, right? Your computer has a file system, and when you open a file in Word, and you open that in Excel, like it’s the same file, they don’t both have to implement a file system to be able to read and write data. And so I think when we to the web, we lost both of these things. And funny enough, both companies that started are solving these two things. And so when I think of data Federation, the reason I use that term is, I think that in order to have a wealth of SaaS applications exist, which is what I want, right, you’re going to always hit this natural friction around replicating the data correctly and consistently, right, because it’s a distributed system. And they all want to speak about the same things. So this is just your always the more apps you have, that all speak roughly about the same things you’re gonna have, you know, master data management problems, you’re gonna have all the things that you know, kind of as a distribute systems minded software engineer, you can think through, and they’re hard. And it only gets worse for every n plus one application you want to use. And so I think there are only two ways, in the long run that this gets resolved. One is the one I don’t want, which is, everything gets progressively acquired. And by larger companies, and because then they can create that integration, right, they can create the tight integration between slack and Salesforce, I’m sure they will. And Microsoft is it because I started my career at Microsoft that I saw this because Microsoft is basically the best company in history of doing this. Having built unbelievably great technology to do interoperation, between its applications, they do this because they can work together. And they can force Excel to do something that then word will also abide by. And so that’s one option. And we see this right, the more we get in the later stages of SAS, which is now your 20 of SAS, right, like, we see these pressures. And the only alternative that I think of is that for some of these things where you need the consistent, you need to come up with a different model than just independently replicating the data in bespoke ways in every application. And so that’s why I use the term data Federation, because I believe that as a company, if you want to use the maximum number of SaaS applications, with the most freedom and not to be tied to one vendor, you want to be able to own your data, and then seamlessly have it be usable in any application. So today, my only option to be able to enable that world for people is to say, Okay, what is the place? Let’s work from first principles, right? Well, you need to store all the data in a way that is like, the most cost-effective and scalable data warehouses like, it’s that or s3, right? It’s like, like, either just raw storage or data warehouse, like, those are your those are the best tools we have first. But if something better came along, I’ll take it. Right. But right now, that is what’s best. And then I want seamless ability to use that data from any application. If I could eliminate the data pipelines and just say, you know, your app is built directly off the data, that’d be great. But because of the way OLAP, you know, warehouses are designed because of the incentive structures in the market today. You can’t you don’t get that right. So you there are tools, by the way in like Salesforce has this concept or the only one but they have this concept like external objects where you can have an external back data store, but it’s slow, and then you don’t get all the features. And you don’t get the formulas, you don’t get the indexes. You don’t get all the things. Yeah. So that’s what Census does, which is we will push the data into the internal file system of each of those products, thus turning them into a kind of high-performance cache on a single Datastore. That’s what I mean by data Federation.
Kostas Pardalis 48:39
Yeah. Makes total sense. I have a, it’s not exactly like a product. Question. It’s more like, it’s probably like a Yeah, it is a product question, but has more to do with like, the experience of building a product? So since you first launched, essentially, today, what you have learned you by building this product?
Boris Jabes 49:07
Great question. I think I would say that I’ve learned the most about our users, right, and data teams as a whole. And so it’s been really fun to watch them on this journey over the last three years just working with people. And so I will, I’ve talked about this before, I think but it really is the thing that always comes to mind. Which the, you know, the first experience I had when we started selling this to users was hey, like, great. This is gonna save me time or this allows me to do the thing that I didn’t know how to build. I don’t know how to write this kind of connector. So it’s great. I had write sequel. I don’t know how to write Python, you know, it’s like, that was the initial kind of experience we had. And that was not surprising. That was not something I was like, Ah, what a discovery. I’ve made that. But then and we talked a bit about this, but it became very visceral to me. After a little while have, especially in the early days or early users using our software, but now it’s become kind of, it happens more often, I started seeing a very unusual reaction from our users that actually caused me real pain. Like I was worried I was actually really like, are we screwing up here was, this seems bad, these are bad. These are not the words you want to hear from a user. Right? You wanna hear excitement, power, enjoyment, right? And multiple customers started using effectively expressions of fear. Hmm, they started, like, genuinely saying, I’m scared that in so many words, 111 cost was like, like this, I feel like I’m holding machine gun, like, paid him. Like, well, that’s not the feeling I want to engender in you. But, you know, so I could have shied away from that could have been really freaked out. But I started to think about it. And what I realized his senses, is this is what I mean by it’s not just the data pipeline, it’s giving these users a power they’ve never had before. Right, the power to do analysis is not new, it’s massively improved, with great tools. But the ability to analyze data is something they have always had the ability to, from your vantage point on the data organization to cause a marketing email to get sent to cause a salesperson to wake up in the morning and with a task to call this person that did not exist before senses. And of course, it’s scary. Like, now it’s your fault if something breaks, or breaking would be ideal. Like if sensors like said, Hey, sorry, the pipeline can’t go today. That’s, that’s not even that’s actually bad. But nowhere near the worst-case scenario. The worst-case scenario is you push bad data, extra data, data that is like, going to be embarrassing when it goes out. Right. And so that was the emotion that we’re trying to convey to me. And so now I spend a lot of time really thinking about how can we build capability and senses, that improves your confidence, right? So so I think this is the point, right? Like we’re in, we have a lot of experience in the world of software on how to be agile, but safe, right? Code reviews, testing, unit testing, like, just decades and decades of research and experience, practical experience on doing this. And so I think the senses is one of our jobs to be done both in the software and in our marketing in our education in our content, right, is to try to teach how to make this less scary, but also to embrace a little bit of fear, right? Because if you, if I don’t want people to go back to, I’m only going to press the go button once a year, because I don’t want to break things. And so that’s probably the biggest thing I’ve learned is that the biggest hindrance to deploying Census is actually helping people overcome this new responsibility, this fear that comes with it. And I’m like, but on the other side is, is, is so much power, so much growth, so much more your team will be able to do, and so you should embrace it, but it is genuinely scary. And so it’s, that’s a first in my life to Delta product that freaks people out.
Kostas Pardalis 53:15
It’s a good problem to have because, of course, I think it’s an indication of the value.
Boris Jabes 53:24
I’ll give you, I’ll give you an example in how this manifests. Speaking of product, like, like a very narrow, you know, cuz I think this is not solved with one giant marketing team is gonna hate me, like one giant whiz-bang feature that you can announce, right? It’s, it’s a collection of very, like fine-grained thinking like small features here and there. And so I’ll give you an example. So there are a lot of products that when you write into them, to your point about like reading and writing is very different. They, they have their determine compilers, you know about there’s defined behavior. And then there’s undefined behavior. And then unspecified behavior, which is actually like a different thing. Just means like, it’ll work, but I can’t tell you what’s going to happen. Yeah, so when you write duplicates into some system, not all, that’s the beauty of it, right? We support like 50 Different applications or like, have different, you know, different behaviors. Some of them will behave in very unusual ways when you sync duplicates, so some of them will reject it, some of them will just pick one, and you won’t know which one, right. And so that is something when we build the very first version of Census all those years ago, we just said here, let’s take the table and like just efficiently Our goal was to get speed. So it was like, let’s get it as efficiently as possible into the destination. And then we didn’t know like, oh, turns out people are Google plenty of duplicates, like the warehouse is not enforcing, you know, unique IDs. So they’re seeing a duplicate and like we were like powering through. We’re like super-fast, like yay, go sync millions of duplicate, no problem. And then you’re back to the same old problem of like, the sales team or the support team or the success team or the marketing team is like I don’t know, this data is wrong. screw the data team, let’s go back to doing our own thing. I don’t like these guys. Yeah. And so now we’ve added the capability. Like there’s a, there’s, it’s built-in, you can’t turn it off, which is we will block duplicates from being synced as we will block. Because even there are some people who are like, frustrated by this, because it’s like, it’s errors that they’re like, that’s not an error, but it was like, but we’re gonna treat it as an error. Because like, you don’t, you’re not realizing this has annoying downstream effects on your team. So it’s, you know, it’s a million things like that, that we’ve had to kind of invest in.
Kostas Pardalis 55:30
Yeah, I totally get that. Like, I think what people don’t, they don’t like two things that I think people don’t realize when they start using products like sensors. One is that the sensors team has to learn to work with a technology that is completely a bug. Right? Like you have Salesforce on the other side. And it’s very interesting. I remember I was talking with one of the lead engineers in Salesforce, about building when they acquired Sure. And the guy was like, we were building this thing. And they were like, cases that we couldn’t predict even big inside Salesforce, like they were like edge cases that we couldn’t replicate by having access to the whole infrastructure and knowledge of like, Salesforce itself has, right. So imagine now that you have Boris in his team, and they try like to interoperate with Marketo, I don’t know.
Boris Jabes 56:39
Is there is there an off-the-record version of this podcast?
Kostas Pardalis 56:46
I mean, it’s the dominant marketing platforms. And for good reason, I’m sure. But like interoperating with it is like a completely different thing. Like there are errors that are not documented, there are behaviors of they are not documented, they are not documented for a very good reason. Because like all these API’s that were not built, like for Boris and his team to write, like, they have like a complete different specification. And that’s one thing that people give to forget, I think the other thing that they keep to forget is that as we add more and more systems into these stack, or architecture or whatever, we are actually building a super complicated distributed system. And distributed systems has like some very specific rules like and delivery semantics, like something that it might sound like very theoretical, but it’s actually very, very practical. And I don’t expect anyone in sales to know that one of the ways that we can deal with that is to have at least once delivery semantics. Yeah, sure. I mean, doesn’t work at the end.
Boris Jabes 57:53
Kostas, I’m getting PTSD. I remember using the word eventual consistency in front of a marketing team. And they were like, no, no, we need it. We can’t have it be a vegetable. And I’m like, it has to be natural. Light is not negotiable. And like, oh, that’s what you mean, I’m like, you because in their brain vigilant, like it’ll come up tomorrow. And I was like, wow, I forgot that this is a term that we use industry does this was it has met nowhere near the same meeting. Also, it goes both ways. Real-time doesn’t mean real-time to a lot of people.
Kostas Pardalis 58:25
Oh, yeah, same. The reason I’m saying that is because I think there’s a very important element. And that’s why we are all responsible for being in this market. And that’s education. Like we need to make sure that like outside the building, like actually, I think like it might sound a little bit exaggerated, but part of the product is also education, like we can help people understand what they can do and how they can do it with their with IT technology, because there are limits. And engineering is about great dogs, and we have to make these three rules otherwise, like we’re not going to have products that work.
Boris Jabes 58:58
Yep. No, I think that’s a I think interesting products tend to have this educational component. And I wholeheartedly agree that that’s part of the, the journey we’re all on and especially again, the world is large. And one of the things I have learned is the world is nowhere near as sophisticated as people think it is. Oh, yeah. I tell people this, even more like Silicon Valley is not even as sophisticated as you think it is. Right? And like we work, you and I work with some of the best, right? And it’s like, sometimes I’m like, wow, this is I remember, I used to really fancy demos in the early days, really. I would try to drop in words like AI to just again, you’re like, yeah, like you need all these things in Canada. And it’s like, and then one day, out of expedience. I didn’t have time that day. I did the dumbest version of the Census demo, like this is back in 2018 90. I did the dumbest, like, where there was like two metrics you could set up in 12 seconds like The Count pageviews you know what I mean? It was like count pages. And then I was like, let’s put that in Salesforce for like a customer success team to know how many times they visited your product. Like, that was it? That was the demo. And I was, I was actually concerned, like and embarrassed for them at first because I was like, they were in awe, like, people are like, this is the greatest thing since sliced bread. And I was like, this isn’t the what, this is the basics. This is not the not the wild demo. This is not the wild demo, like what? Why do you guys wowing? And it’s like, you forget how, how starved people are for this, right? And then you’re right, it goes hand in hand with then you start delivering stuff. And then you have to, we have to find a way to we’re going to have to do a book like distributed systems for regular people. Just because it’s because I think it’s too intuitive for you. And I like it, we know it so well, that we take it for granted. And then you end up in these weird miscommunications. And think about, I think the need to educate is doubly so. Because you are right, that we need to educate, just to serve our own users, right, like, think of what Fishtown DBT have to do right to teach. The concept of version control is like super valuable just to teach that is unbelievably valuable. And if I think about what we’re doing is we’re turning the data team into this kind of like company platform team. So we need to help them explain what’s happening to everybody else, otherwise, they will, they will be, they will also fail. So we have to act as like their advocates, to the rest of the company. And like that’s super essential. So you’re right, the education is unbelievably important. Hopefully these conversations help.
Eric Dodds 1:01:39
This is great. Boris, this has been such a fun conversation. Brooks actually let us run a little bit long, which is super fun when we get permission to do that, but we’re at a time here. This has been such a fun conversation really helpful for me, and I think definitely for our listeners as well. So thanks for the time.
Boris Jabes 1:01:56
I mean, thank you, thanks for having me.
Eric Dodds 1:01:58
First of all, I have to say that Boris is so articulate, I find myself jealous of his ability to explain complex things, and even dip into the world of you know, sort of formal computer science in a way that’s so accessible. So hey, I appreciated that and learn a ton from him. My takeaway is around the way that he described, sort of value that’s created in the warehouse as it relates to data that’s transformed, say, for downstream tools, sort of creating value with data, right. And he described that as any data that needs to be joined in order to produce some sort of valuable asset. He described that as IP, which I think is such a helpful way to frame the concept of creating whatever kind of value we’re creating in the warehouse, right, whether it’s a unified customer profile, or packaging, some sort of analytical component from one business unit and sharing it with another. So I really, I just really appreciated that. I think it’s been helpful for me to think through that.
Kostas Pardalis 1:03:05
Yeah. I mean, okay. It was like an amazing conversation I think we had with him. General, there are like many insights for someone to take from this conversation. What I keep, I really liked how he’s using the term Federation. I mean, this was like something that we discussed also during the day. So traditionally, Federation has a different meaning. But it makes a lot of sense the way that he’s using like the term Federation. And that was like, very interesting. And it was also like, super interesting to discuss with him about, like, all the challenges around building a product like this. So hopefully, we’re doing a coffee mug in the future. We have more stuff to talk about.
Eric Dodds 1:03:51
Absolutely. All right. Well, thanks again for joining us on the show. Lots of great episodes coming up. So we’ll catch you on the next one.
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.
Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
To keep up to date with our future episodes, subscribe to our podcast on Apple, Spotify, Google, or the player of your choice.
Get a monthly newsletter from The Data Stack Show team with a TL;DR of the previous month’s shows, a sneak peak at upcoming episodes, and curated links from Eric, John, & show guests. Follow on our Substack below.