Episode 137:

Data Collection Secrets & The Search Data Problem with Josh Wills

May 10, 2023

This week on The Data Stack Show, Eric chats with Josh Wills, an experienced data scientist with work at places like IBM, Google, Slack, DuckDB, and others. During this conversation, Josh shares his journey working at these large companies including work in data engineering, data science, and other fields. Eric and Josh also discuss high-quality data and the process to get it, auction code, usage patterns and complexities in search, and more.

Notes:

Highlights from this week’s conversation include:

  • Josh’s background in data working at Google, Slack, and other companies (1:21)
  • The need and process for high quality data (4:33)
  • Digging into auction code (14:03)
  • Joining Slack and working in the early days of the company (18:00)
  • Not fighting the last war in data (25:42)
  • Building a product, while using the product (30:35)
  • Transitioning to the search team at Slack (36:50)
  • Usage patterns of search (41:21)
  • Josh’s work in helping build DuckDB (46:20)
  • Having the right toolset to increase precision and efficiency (52:42)
  • Final thoughts and takeaways (56:03)

 

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:03
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. Welcome back to The Data Stack Show today. I’m actually flying solo, Kostas is caught in an airport and cannot get to a quiet place to record. So I am going to chat with Josh wills. He has an amazing story. He worked at Google, he worked at Slack. And he started his career as an analyst, and then sort of got deeper and deeper into data science and data engineering. So I want to try to hit on all those topics. He also is a regular contributor to duck dB. And we’ll sneak that in if we can. I am most interested in hearing about lessons learned from Josh, because he was at Google at a really interesting time at Slack at a really interesting time. And is experienced sort of hyperscale and being the first person to build out a data team and data infrastructure in several different contexts. And so I think there’s a ton that we can learn from him. And then hopefully, we can nerd out too. He’s a brilliant guy, and uses every tool under the sun. And so hopefully, we can dig into the technical stuff. So let’s dive in and talk with Josh waltz. Josh, welcome to The Data Stack Show. So many things to talk about. So thanks for giving us some of your time. Hey, Eric, my pleasure. Thanks so much for having me. You’ve done so many things with data before the show, you described it as sort of, you know, the villainous descent, you know, sort of deeper into the stack, which may end with you creating silicon wafers. But give us a quick recap of the villainous descent. Oh, yeah.

Josh Wills 02:01
That’s a great question, man. So I think I started out in life, I suppose I was a math major in college. And I thought I was gonna go to be like a math professor or something like that. But it turned out that I was good at math, but I wasn’t that good at math. You don’t mean, I wasn’t like math professors good at math. So unfortunately, I had to, like, you know, get a job as a software engineer and stuff like that, which was, again, fun. I liked computers. And I liked data. I like to analyze things, right? So I started as

Eric Dodds 02:28
Which discipline did you study? Just out of curiosity?

Josh Wills 02:32
Oh, I mean, I studied, like, I did, like, effectively a survey of math, we’re all done for anything. It’s like, well, I mean, when I got to college, I really wanted to be an analytic number theorist, like Ramana. John was basically to prove the, like, prove the Riemann hypothesis was kind of my goal. I think Eric Yeah. Like, we’re going to talk about my career today. And all this, you know, interesting stuff I’ve done, I basically think of myself as a failed mathematician, more or less like my mental model of who I am, you know, like, math guy who can’t hack, it basically, is me, you know. But now, um, so yeah, I took statistics in college, and I liked it. And it sort of, you know, appealed to me and stuff. And so my first job out of college is working at IBM. And I was analyzing wafer processing data. So I was like, working for like, those very hardware engineers we were talking about before the show. And I was analyzing, like, like, the manufacturing process, basically, like, how do we, is this chip going to work? How fast can we run it, like, I mean, megahertz and stuff like that to gigahertz, whatever. And, you know, I did that for a while. And I did that at various places. And I kind of just rely on a prayer like, at what point I realized, like, it sort of maybe slowly dawned on me over the course of like, five or 10 years, that the key to really good analysis was really good data. And that I could do all the fancy crazy statistics I wanted, you know, to compensate for the deficiencies of the data. But really, like, 99% of the time I ever spent getting better data, like that was the whole thing. And that was what that was, that’s my villain origin story is that I, like I love Alice Waters, the chef who has like this, you know, like high quality ingredients, simply prepared, that’s the California lifestyle of food. Right. And that’s very much been my approach to data analysis is like high quality data, simply prepared. That’s like good analysis and stuff. And that was a big switch for me. Like, again, when I was a kid in school and doing all this fancy math stuff. I was like, oh, yeah, you know, we clearly need to use the little bay integral here to solve you know, like, just sort of nonsense like this, right? Just to show off how smart and technical I was, but like, we’re in Leeson a stats II kind of way, but really like, yeah, that didn’t end up yielding the outcomes I was after and stuff for as high quality data.

Eric Dodds 04:50
When you think about the data, you say, You’ve done so much data, things like really large companies, some small companies. When you look at this space today How much effort do you think is going into the data? Like getting high quality data versus actually like doing the math, right? Because we hear about doing the math more, because that’s flashy. But yeah, I suspect, like a huge amount of the expenditure, both in sort of time and money.

Josh Wills 05:15
I mean, I completely agree. You know, like, it’s, it isn’t sexy. It’s not like, whatever. But all? Um, that’s a great question. It’s a great question. Is it like this? I don’t know.

Eric Dodds 05:27
I mean, 8020? I don’t know. I mean, I’m not asking you to put a percentage on it. But it’s just

Josh Wills 05:33
like, what I’m trying to say here, I think is like, there’s the data we get that’s relatively easy to get, right, which is the data we have easy access to. So it’s like I hooked up with Fivetran. And it pumps my data from my production database into my data warehouse. And I analyze, right? And then maybe I have some logs data, maybe I hook up segments, maybe I hook up, you know, like RudderStack, maybe hook up Snowplow, like all these kinds of things. I’m trying to be fair to all the vendors in the space,

Eric Dodds 06:02
or whatever, like there. Yeah,

Josh Wills 06:03
exactly. And then I analyze the shit out of that data. But then how often is it like, am I going in and saying, you know, if I have this additional bit of information that is available during the context of this request, but isn’t being logged right now. It’s not being stored anywhere, if I can grab it and introduce it here, I’d be able to do like, way more powerful stuff. And then it’s kind of like crickets at that point, right? No one’s talking about that kind of stuff. Like, how are you? How are you bridging that gap? How was my like, and I think like, my like, is like the secret of my success, I guess, as a sort of data engineering person is like, I am the data engineer who is more than happy to go into the front end code base and get the field I want and send the PR to the front end engineer, and get it merged and get it piped, like through the rest of my system. So that my analysts can go do data with it. And like, that’s just like, what I do, and I’m not, I don’t know, I don’t know why more people don’t do this, I guess. Is it like, is it intimidating? Is it like, Is it scary and different entities gonna be mean to you until your codes? Not very good? Like, I don’t care. I’m not a software engineer. I’m a data scientist. I don’t identify as software engineers, my code crappy? Yep. Probably, like, I don’t give a shit. I want my data. Like, I don’t care. Yeah. I’ve been very fortunate in my life to do a bad job, lots and lots of different things and have many more competent engineers come along and fix it for me. Once it proved to be useful, I think, you know. But yeah, I mean, I think it’s that last, like, we talk a lot about the tools. We talked a lot about the databases, the visualizations, we talked a lot about the methodology. We talked about how we collect data, we don’t talk about getting the additional data going in, like going kind of above and beyond. That’s what we don’t talk about. And that bums me out. Because that to me is like the, you know, quiet thankless. But ultimately heroic work that makes like, like the real difference, I think, and how well we do and the impact that we can have on the state of people. Yeah, yeah. 100% Okay.

Eric Dodds 08:04
So you realize, as an analyst, you know, I need to spend more time on data quality, because, you know, I can do the math, but really, yeah, but it is a big deal. So

Josh Wills 08:18
yeah, well, data collection data, like, to me, it’s like data, people are like, there’s governance, there’s quality, there’s provenance. Like me, I literally know what this data means because I collected it myself, right? Like I am out there foraging in the fields in the forest for the mushrooms or going into the mushroom soup of the dish. I’m preparing. Right? That’s my approach to data collection and data analysis, right? There’s no governance thing. There’s no MIS, there’s no game of telephone between like me, and the data, you know, all this kind of stuff. It’s just like, literally, I collected this. I know, it’s good. I know, you know, I know what it’s supposed to look like, because I went and read the code that generated or wrote the code that generated it, right. That to me is what makes the difference? Yeah, that’s kind of what I’m talking about. And again, in terms of people who do this it is pretty much just me, as far as I’m concerned. Like, I don’t, there may be other people who do it. But like, I haven’t run into that many of them, you know. So when you’re

Eric Dodds 09:15
Following that path leads you like, okay, so you leave the analyst’s world? And yeah, that’s the right role. Did you like were you sort of that, like, I’m crossing frontiers as a data engineer and getting all this data? What was your next?

Josh Wills 09:28
I think it’s an interesting question. I mean, so is Google. That was where I mean, a couple different things. Like when I was just an analyst, like at IBM and stuff that was not that person yet. I kind of became that person at Google, like Google was where I kind of transitioned from, like Google hired me as a data analyst. Like a statistician. That was my sort of career ladder there. But then over time, I transitioned over to the software engineering ladder, just because the leverage I had as a person who was willing to like get into the ad auction code itself and just literally grab the data I wanted and make it available to everybody on the team was so much higher than me doing any one off analysis, right? Like just by far, obviously. Right. So I made that transition there, right. When I left Google, I went to Cloudera. I was at the time, it was kind of unfortunate I was like a lot of other people. I had been roped into working on what eventually became Google Plus, like Google’s kind of Facebook killer social thing. And it was just like, for any number of reasons, just an absolute nightmare project to work on. So I quit and went to work at Cloudera, because I loved the tools that we had at Google for working with, you know, large datasets. And I wanted everybody else, you know, to be able to use those tools as well for all kinds of different problems. And so I want to do that at Cloudera. And so, I still talked about, like, how we did this stuff at Google. But it really was a period of time where I was very much away from data collection. And I was building tools and stuff, right? And then go into Slack, brought me back to like, Okay, I need to build a data collection system that is akin to what we have at Google so that we can go do so that people like me can do the kind of work that I did there, which is go collect instrument, give high quality, immutable copies of every single thing that happens and do all this stuff with the downstream and stuff. And so that was what brought me kind of back to that work, to being like, military focused on data collection, and like, getting high quality data from the source and all that kind of stuff. So yeah, that’s like roughly all that one.

Eric Dodds 11:25
Yeah, yeah. So I want to zero in on a couple of those steps. So yeah, at Google. So you started as an analyst, when you you know, you talked about going into the ad auction data? Yep. Was that political at all? Like, no.

Josh Wills 11:43
Okay. It was to Google’s credit, it was super not, it was super nice. If you were in a technical role at Google, like an engineer, or like a statistician, we all did. I mean, I think we all did code reviews, we all used the same source control system, we would, you know, source control and review our analysis scripts, we would source control and review our code. The thing I think it’s important to know is like, the reason analysts have a hard time getting engineers to do this kind of work of just like, add this field and copy it from, like this record to this record, is that this is not like hice, you know, status engineering work, right? Like, no one’s gonna see this. It’s not doing some badass algorithmic backend thing, like, you’re not going to talk about this at a conference. It’s not sexy work. But to an analyst. This is everything. This is oxygen. This is everything. As an analyst, I am thrilled to do this work. Like I am thrilled to go like, copy the stupid field from like, you know, protocol buffer a to protocol buffer be like, nothing makes me happier. Yep. And so yeah, absolutely. Like, wasn’t, yeah, wasn’t, it was also I think, was funny, you know, like, all like, a lot of the time it was funny was like, Google’s auction code was just such a disaster. Like, back when I did it, it was just so that some parts of Google’s code base are pristine, elegant, gorgeous, and other parts are just, you know, absolutely toxic waste dumps, right? Yeah. And this was, the auction itself was one of them. And so that was another reason why no one really stopped me. Because it was like, Yeah, I mean, if you want to go wander in there, sure. Yeah. Good luck.

Eric Dodds 13:22
Basically, the House of Horrors.

Josh Wills 13:25
Yeah, exactly. Exactly. And again, not high status work. But by doing it, I became the expert on how the auction actually works. Like, how did it really like? How did things actually work? How did all the various pieces and stuff interact with each other to ultimately determine the position and price of an ad on google.com? And that turned out to be a very valuable thing, because if something went wrong, or you know, I was the one who could tell, Okay, this is all the ads we placed incorrectly. And this is the compensation we needed to do. And this is how we prevent it from happening again, and all that kind of stuff that turned out to become like a very, like, I mean, I find this happens, like, not all the time, but more often than you’d expect, like doing like, shit, low status work, can often like become this kind of thing where like, very crucial role because you’re willing to do something that other people aren’t doing. So you create a lot of value. Again, that’s just not always the case. But in my career talk with a few times anyway, did

Eric Dodds 14:21
you Okay, so you’re digging into the auction code? Did you think a lot or sort of go down the path of thinking through sort of the I don’t want to say this like almost like higher level economic principles that govern? I know you’re looking at the like, is this served but an auction by nature is an economic system, right, and you have supply and demand and did you dig into that?

Josh Wills 14:49
Yes, it was funny. I was in Austin. I lived in Austin for a few years after college just before I came out to California for Google, and I was getting kind of bored of my job. And I was also like, just kind of that sort of massive mistake 20 Something EDA that likes to just do things that are impressive. So I started graduate school, the University of Texas at the same time. And so I was studying optimization theory and statistics and stuff like that. I took a class in mechanism design, which is an economics discipline of like, how do you design auctions and stuff like that? So I actually had an interest, I had an interest in this area coming in, especially interested to like go to Google to do auction stuff. Real love for real money?

Eric Dodds 15:29
Yeah, it’s a real Yeah.

Josh Wills 15:30
So I had like, I had a decent academic background in this stuff, like in the sense that I like taking a graduate level course on auction design. And then at Google, they had like, of course, like literal, real world, auction experts, people who’ve done PhDs and papers and stuff like that, that I could work with and call on for their expertise. And again, they weren’t in the code. But they were the experts who could kind of explain how things were supposed to work, and analyze stuff and all that. So anyway, I was kind of the bridge between the engineering team and the like, research, sort of analytics, folks. Yeah.

Eric Dodds 16:00
Fascinating. Fascinating. I mean, gosh, I bet that was really fun.

Josh Wills 16:05
It was super fun. I did it for a number of years. And you know, I quit honestly, I left to go do something else really just kind of when you’ve reached a point where like, it’s sort of a bummer. But it’s like, the limiting factor on how interesting an ad auction can be, is really like advertisers. We dislike so much more interesting computations, like, like, how much would you know, like, what do you want to do? If you do it again, I’m just gonna keep using you guys as an example. If I go to Google and do a query for RudderStack, then I’m going to see all of your competitors listed. And ads on the top of the page, right? Does everyone else? If you go query Snowflake, there’s a Databricks out, right? How much should you be willing to pay to not have those advertisers show up? Like, this is a thing you can do? There’s a whole auction theory around this, like, this is a sort of like product you can buy. But it’s very complicated. And it’s so complicated that like, you would need a PhD in Economics for me to like, explain to you like ecology, should you like, bid on this kind of thing, right? But for a computer, it’s relatively simple. Anyway, like, it’s yeah, it’s I don’t know, it’s one of the things we do this project rebuilding is very cool, awesome, like an auction simulator that would let us like a back test for advertisers, different ad strategies and all kinds of cool shit like that. But he kind of ended up flailing over the fact that we couldn’t figure out a way to make it easy enough for advertisers to use it to make decisions. Like, it can make it Delta Lake. Like those sort of math tests you do on the LSAT, where it’s like, they show you a table or a chart and ask you questions about it and stuff. And it was like, it was kind of like that kind of thing. Because we could share data from it. But they wouldn’t be able to draw the right conclusions from it like, Okay, well, this is not actually what you want. And it’s such a bummer, because it’s like this massive power imbalance between like, very large advertisers or Amazon or eBay who have teams of people like me doing this stuff all the time. It can optimize places, like down to the fraction of a cent. And then just like tiny little mom and pop shops on Shopify, or whatever, who are just like, bidding $1. And like, they have no idea, right? It’s just anyway. Yeah, it’s a whole thing. It’s okay. If you know, this is like, I could talk about this stuff for hours, but I’m sure there’s other stuff we could talk about too.

Eric Dodds 18:15
Super interesting. Okay, let’s jump to Slack. So, you joined slack. How big were they when you joined?

Josh Wills 18:21
It was 240? People? 240. It was like, by my standards, that’s a big company. Yeah. Like clutter? Ed? Yeah. Anyway, yeah. But the data team was small. The Data Team was small. There was one data engineer. And then I was hired to be that

Eric Dodds 18:37
One data generic, one, one data engineer. He started two

Josh Wills 18:43
He started two months, like roughly two months before I joined. It was like, yeah, when he came on board to start building the early data infrastructure. That’s right.

Eric Dodds 18:51
So yep, that’s what did they do for analytics? I mean, that’s a great question. Wild to hear.

Josh Wills 19:00
Yeah, so I think there’s a few things here, right. I think, for a number of years, I think slack was well known as the fastest growing enterprise software company ever. And one of the things that’s kind of underappreciated about being the fastest growing whatever, is that you don’t really actually have to do anything. Erica, you don’t have to do anything. You can just like, kind of sit back and just people will keep coming and showing up and using your product. Like there’s you don’t need to like, you don’t need to AB test anything marketing attribution. I don’t, doesn’t matter. Right, exactly. Right. I mean, they had a marketing stack, right. They had the, you know, in front of them, like, I don’t remember all the things they tried, there was Optimizely. At some point, there was Marketo. There was amplitude Debezium is all the things right? Yeah, in fact, one of my main jobs really was just like throwing all that stuff out actually from like, Slack sack. One because no one was really using it again, because like, the product just grew. You didn’t have to do anything, right. And I’m being facetious here when I say this right? Like, obviously, I’m sure they worked very hard on the product. And like Slack is kind of the is the poster child of product lead growth. Like that’s, that was their idea that like, just make the product really good and people will just use it right. And they AB test landing pages. And again, there’s like things here, but like, by and large, you did not need a massively sophisticated data infrastructure, do any of this kind of stuff, right? Anyway? So yeah, yeah, I started building it really, primarily to provide the infrastructure we needed for when that stopped working, like, as much as anything like once you sort of, you know, fill your initial Mark target market, and it’s time to start growing beyond that to the enterprise at all these other people’s like that you’re, you do need to get serious about all this kind of stuff. So as buildings were the infrastructure to make that, and then also the infrastructure for like, machine learning applications and stuff like that, that was my other sort of major sort of various focus was like making it possible for us to do, you know, search ranking, and retrieval, optimization and stuff like that, like making making all these things possible, requires, you know, really good data infrastructure. And so that’s, that was what I set out to build. Yeah,

Eric Dodds 21:06
Yeah, absolutely. So what okay, you come in, there’s one data engineer there. Yep. How did you decide where to start? Right. I mean, is it? Yeah, I mean, he’s growing very quickly. Of course, like, you’ve lived so many times, like once team start getting data that’s useful, it creates an insatiable appetite.

Josh Wills 21:28
Totally. I mean, I did it basically wrong, Eric, in every which way. And I mean, so I talked about that a little bit a few years ago, where I was like, going through, like the list of mistakes I made, like building dead slack, the thing I started with was really data collection. And it was data collection of the form that we had at Google. So at Google Lake, everything at Google is a protocol buffer. Protocol Buffers is kind of a vulnerable binary format. People use it for a bunch of different things these days. So I use a variation of it that Facebook had created called thrift at Slack, because it was kind of a better fit for our stack. And so what I said essentially was okay, at the time, like, the data warehouse that existed was like Parquet files in S3. And we’re using AWS, we’re using EMR. And so we were running kind of a Netflix style data architecture, like spin up new clusters, and presto, and stuff like that to you stuff, all these kinds of things. And so I said, from now on from going forward, the data warehouse will only accept data records that have a thrift schema associated with them. This is it, this is the only thing that we will read, you can’t send us JSON anymore, you can’t get everything and must come in with a theory of schema that was like the rule. And I was again, just incredibly fortunate to be at the company at a time where I could make that kind of decision and get buy-in from all the stakeholders, because there was no one basically to stop me. And so that was like the very first thing I decided to do. And that was the only thing I did, right. But it worked out in the sense that like that same kind of rule, and protocol still exists today. And slack can reprocess all their data for like 2015 going forward, because it’s all like thrift records, they could transform it to anything they want. And this is a very good thing. Everything else I did wrong, though. I did not first of all, I didn’t think right. Yeah, well, sometimes it turns out to be like, you know, I used to work for a guy named Jeff Hammer, Bakker. And he gave me advice about management. And he was like, if you hire the right people, and you motivate them properly, you incentivize them properly to do the work, you could do everything else wrong, and you’ll still be okay. And I feel like I was a very, like, prominent example, that management philosophy and action because I did that, but then I did everything else wrong, more or less. In terms of things I did wrong, I did not have a sort of flagship customer. When I first got to Slack, I was doing like, kind of peanut butter style. Let’s talk about the growth team a little bit, let’s help out the platform team a little bit, let’s help out the performance team a little bit, let’s help out the machine learning team a little bit. And so I didn’t have a flagship, you know, like a big customer who was there and it was like, I was there for them. And they were there for me. And like we were gonna go build this stuff, right. So I didn’t do that. I was spreading stuff kind of all over the place. And that was a terrible mistake. And I regret that. I insisted on building everything ourselves and using open source everything and running everything ourselves and stuff. And that was also a terrible mistake. Because I like running, you know, bleeding edge spark versions. This is like back at Spark like 1.4 and stuff like that. Right? And so I’m spending all this time debugging spark issues that don’t have a Stack Overflow answer and stuff like that when I should have just been like, you know, waiting an enormous check to Snowflake and then like letting them go do this for me. Like things like that, right? So yeah, fundamentally, like Eric, I came in with a bunch of preconceived notions of like, I am going to make slacks data systems look as much like Google’s as humanly possible, because I know that if All this data stuff looks like Google’s data stuff that I can be successful, and I will be successful, this will be good for me. And I see this, like all the time and early stage startup employees that come from big companies, it’s like, if you could just take this function and make it work exactly like the function worked at my previous company, I could do all this awesome stuff for you. Right. And it’s like a fairly classic error, as opposed to taking the time to understand the culture, the needs of the company, like all that kind of stuff, right? Like what’s right for what’s right for this place? Not what was right for the last place, right? I made like, kind of all of those mistakes. And so I like, yeah, I don’t hold myself up as like, like, to the time that I got anything, right. It was just kind of a coincidence, or like an accident. You know what I mean, as opposed to like, I was thinking things through first principles like some kind of management or technical visionary. I wasn’t, I was just lucky that this one decision actually turned out to be right. Because I can choose a litany of other decisions that turned out to be wrong. So yeah.

Eric Dodds 25:56
Anyway, do you feel like that dynamic? You know, I’ve heard it is sort of like fighting the last war. Yeah. Do you have an analogy? Absolutely. Yeah. You mentioned a few things that fall into that category. In terms of like, the technology side? Did you also bring, like, any cultural or sort of management elements of that from Google?

Josh Wills 26:20
Yeah, I did. I did. I think the management one that I ran into this with a few other Googlers, who can’t dislike, Google is very big into like, technical leadership and management being in the same person being like, the same sort of like the same human is both like the technical leader and the manager for a group or a function or something like that. And engineering directors at Google are expected to be very technical, all that kind of stuff. And it’s like, it’s deeply not the case, or at least was not the case, when I was there, like technical leadership was a separate function, separate person, from human management, like managerial aspects of things. And that was a culture shock for me, because I was not like that it did not operate that way. And so really just did kind of like naturally fit into slacks management culture, which didn’t want to operate that way. I think it’s interesting. I again, like most things with time and perspective, I come to appreciate the virtues of both systems. Google’s system is good from a communications perspective. And that, like, all of the information that’s kind of critical in the interactions between technical decisions, management decisions and stuff are all in one person’s head and stuff like that. And that saves a lot of time. And like, you know, means fewer meetings and like less overhead and stuff like that from it. Downside, and kind of the most pernicious one is really like, the Google approach is kind of ripe for abuse. There is, if you work for one of these tech lead managers, there is essentially one person who has a tremendous amount of control over your career in terms of like, what you do and how you do it and stuff. And that’s not good. You know, so there was like, I would say, far more like, and this has obviously been dealt with over the years and stuff at Google, but like, there was a lot more like, questionable behavior, I would say from certain technical managers at Google. That was just absolutely not tolerated at all. It’s like, like, they were just like, if anything like that happened, like you were just gone, that you were gone like that day, like, and that was like, in many ways, like much better. I think it’s kind of a checks and balances system to prevent abuses of power. Like, again, no place is perfect. By any stretch of polarity. It was significantly better, I would say, so.

Eric Dodds 28:33
Yeah, that makes sense. That stuff, yeah. Combinations of authority, like, you know, people sort of will naturally act out of self interest. And so when you combine like different modes, zactly, different realms of authority, like, you know, the chances are greater.

Josh Wills 28:49
Absolutely, exactly. Again, the good is better, and the bad is much worse. Again, it’s just like, totally choosing what is right, and what makes sense for your organization. I think it’s like, you know, it’s like, this is kind of what strategy and company culture and all this kind of stuff is about, right? It’s like, if you want to do something, and everybody wants to do it, like let’s ship high quality software. Yeah, obviously great. Everyone wants to ship high quality software. That’s not interesting. That doesn’t say anything about you. But like, what are you willing to give up to do that? Are you willing to hold timelines indefinitely? Like , what are you giving up? If you’re not giving anything up? It doesn’t mean anything. Right? Yep. So yeah, it’s that it’s, I think again, in my old age, and my experience and stuff, it’s like being able to see those trade offs and kind of understand that stuff. Literally, I’ve always wanted to. I didn’t want to ask the CEO. I wasn’t, I guess, like, I would love to find a CEO of a company who could honestly tell me what the most important function of their company was. I never actually got to see it. Honestly, tell me that. Like, I can ask anyone who works there, and they can clearly tell me what the most important function is, like we are an engineering centric company. We are a sales centric company. We are a design centric company like it’s unambiguous to everyone who works there. But like just a CEO that would just say it would just be so refreshing and interesting.

Eric Dodds 30:03
But I love my children the same. Of course,

Josh Wills 30:07
Everything’s important. They’re old ones. Yeah, but it’s not though. It’s just deeply not true, right? It’s never true, right? Yeah. Anyway, Yeah,

Eric Dodds 30:14
That’s, uh, yeah, it’s interesting. I mean, feature requests, right are generally like, that’s a great proxy for like, what’s actually important to a company like, exactly tail wagging the dog? Exactly. In one way or another like that. 100% that over and over

Josh Wills 30:30
100% Totally.

Eric Dodds 30:33
Okay, you actually change the role that slack is doing data engineering. But before I want to dig into that, because that’s super interesting. Search. But before we jump into search, yeah, one thing that’s really interesting to me about companies like Slack is that you are sort of a daily user of the product that you’re building infrastructure for. Yeah, we had Eric Bernard’s son who was at Spotify, really early on, and he talked about this visceral experience of like, building is like, okay, we’re all like using Spotify ever, like, for 12 hours a day while we’re building it? Yep. What was that dynamic like at Slack?

Josh Wills 31:24
Oh, that’s a great question. Man. I wish this soup was so noisy to ask because no one really ever has been asked that before. It was amazing. In almost every way, I guess trying to the things I would say here, Slack used slack for everything. Everything, every process at a company that would normally be handled by workday, or email or documents was done via slack, via slack integrations, all of our data analytics, like we built our own kind of in house dashboarding tool or like mood or something like that, but had like very deep slack integrations with it so that you would like, interact with charts and visualizations. And like, you know, again, it was like, we basically backdoored all this stuff in the Slack product, because we were slack. And we could do that. Right? Yeah. So yeah, everything was done in Slack. I used to do it, I kind of miss it. Actually, I used to analyze Slack usage at different companies to kind of understand, you know, it was hard slack, like Slack users didn’t really slack, he didn’t really churn. And so churn was always kind of tricky for us, but like we were trying to understand, good slack usage like good slack usage versus bad slack usage. But there are people who have slack that use it essentially, like just for D hubs. Like they don’t use channels and stuff, right. Yep. And I consider that like bad slack usage. Because like, what, like, what’s the difference between using slack and using any other DM client, right? Slack was kind of like off as a company as a user of its product was like off by itself. Like, it’s some in terms of like the number of channels consumed per user per day, like the number of times people visit a channel and write messages. And that stuff was just like, it was just off the charts. That was the one, the joke at Slack was always that slack, the product was always optimized for whatever size company slack was at the time. You know, so like, that was like, if when slack was 10, people, Slack was best for a 10 person team. So slack was 240. Bill, it was best for a tuner for a person, like so on and so forth, kind of up the stack. And the hard part. And this is something that I think like, I think I don’t want to like, I’m not gonna want to pick on anybody here. But like DBT has gone through this DBT is one of my angel investments. And I do a lot of DBT stuff. And so I know a lot about DBT DBT has gone through this paper growth thing that that slack went through as well, like in the dead space, and they’re kind of the poster child for this stuff, right? And bedroom, bedroom. DBT was on this show a little while ago. I think you’re ready to be very famous, like we need to talk about DVT. This kind of blog post was really good. And he said a lot of things that were deeply true. I think what’s hard for folks who aren’t at these hyper growth companies to understand is how, like, all the wheels come off of everything. All the time when you’re trying to grow. And especially when you’re trying to grow into very large enterprises. Yeah. And so for slack for us, that was really like when IBM adopted slack. And we went from like, the largest team being like 8000 people to the largest team being like 100,000 people, like, dude, everything broke. Everything broke for a year, every single thing, right? All we were doing, all we were doing was just like fighting fires and like making the same work for IBM. That was like there was no honest engineering capacity for anything else, right? Yeah. Again, it’s definitely one customer that is a flagship, but it’s all your processes. It’s everything. Everything is constantly breaking all the time. And it’s hard to see that externally. All you see externally is like wow, Slack is not really sure what any features mean and like blah, blah, blah. It’s like we’re actually shipping an enormous amount of features, but there’s nothing you can see because you’re not on a 100,000 person team. Yeah. Yeah. And yeah, and that’s, it’s just incredibly hard and exhausting. And I feel like at this point in my career, I don’t want to do that again. Yeah. Don’t feel I blanked, like, I’m not that experienced that, like, I have a lot of friends who are at odds and airtable. And like other companies like this, I’m just like, Yeah, you know, y’all, I’m so happy for you guys. It’s great. But I’ve done that. And I don’t I don’t feel the need to do that again. I’m okay. With just like, you know, pretty regular growth. Yeah. I don’t need to experience that again. I got my taste. I’m fine.

Eric Dodds 35:43
Yeah. Yeah, that makes sense. I’ve heard it described as sort of appreciating the physics of the system. Right. Yeah. Rate, you know, causes chaos. He: Yes. Destruction, you know, like, and I would imagine, at SLAC. That, you know, because, you know, maybe if you think about, like, infrastructure where there’s not a lot of, you know, user interface elements, right? It’s like, okay, well, that’s a little bit different. But I would imagine with Slack, like, the physics are actually like, catching everything from the UI down to what you are doing on the data collection side on fire, because it’s like, okay, well, IBM needs this thing. It makes sense. Well, how do we prioritize that against like, all of these other feature requests and PRDs and everything that, you know, everyone else has? Totally, here culturally, you have great product managers who are like, I just got deprioritized. I mean, you know, I can’t imagine. Yeah.

Josh Wills 36:47
You don’t want Eric, it’s not I mean, it’s not like any anyway. Yeah. Yeah. It’s not fun. I don’t say Yeah, exactly.

Eric Dodds 36:59
I mean, it’s fascinating. It’s absolutely fascinating. Okay, so you got it? Did you get interested in searching? So you went to work on search? As slack? Can you describe that transition? And like how that came about?

Josh Wills 37:13
Yeah. So when we were talking about all the things that were breaking in Slack, and search was actually like, pretty high on the list of things that were breaking and kind of a kind of circa circuit, like, really late 2016, I think. And then, like 2017. So the original slack search stack was built on solar, like solar for, you know, open source, like source technology and stuff like that, like a much older version of it. And we needed to do a whole bunch of things to upgrade it. And the data engineering, part of the problem was that we needed a way to build like, a sort of historical search index for slack that was sort of optimized. So how do I explain this? Um, there’s two kinds of like search problems people have, broadly speaking, there is right to intensive search, which is like Elastic Search, okay, like, like logs ingestion, or Splunk, or something that’s write intensive. So for write intensive search, writes and updates the index out way queries by like a factor of 100. And then the rest of search, like E commerce search for web search is read intensive search, right. So like, writes are relatively rare. Everything is optimized for reads. Slack amusingly is kind of both in some ways, in that unlike ElasticSearch, you keep maybe a week or two of logs, right? Yeah. At slack, you actually have everything forever, right? So messages are getting written at a super fast rate, like 100 acts, right? However, I’m typically speaking about 99% of messages never change, within like 10 minutes of being written, there will be no changes, no modifications to them ever. Right? No edits. So you can treat the historical index as a read only index, and then have a real time index for just the stuff that’s happening right now. So you need to do both, and then kind of unify them together. Fast. And so the problem of how do you take the existing right index and restructure it to become a read optimized index, and like just reorganize the data and sort of build everything for reads is essentially a data engineering problem. And it’s like the data engineering problem that Google had to solve to build web search. And so the data engineering team I was managing was working on this problem when I was managing it. And I was just kind of again, because I’m like, super technical and can’t like, basically can never tear my fingers off the keyboard. Basically, when it comes to programming. When I left management, I basically pushed the engineers who were working on the problem out of the way and took over the problem from them, because I wanted to do it so badly. This is again, kind of, you know, more examples of me being not a great manager, director type person, but nonetheless. So yeah, that was the like, gigantic data pipeline problem that I had to solve just like how do I take all this data? and restructured into a read optimized format to fix a bunch of performance issues we were having, like performance was the big problem like the P 95. Like query latency of slack was like five seconds or something like that back in 2017, like you type a query, and we take five seconds to get a response, which is basically like, essentially infinity, for all intents purposes, right?

Eric Dodds 40:20
I actually remember this, yeah. degrade pretty significantly once your space got

Josh Wills 40:28
guess. Exactly, exactly. Because it really was because all of your data was being served off of a single server. That was, you know, like, co hosted with a bunch of other teams, it was like a multi-tenant kind of instance. And with like, essentially no sharding, whatsoever, like replication, but no sharding. And so if you had like a noisy neighbor, like you were basically screwed, like, your search just wouldn’t work. Because someone else was so dominant. And again, because there was no dedicated sort of right centric infrastructure and read centric infrastructure, it was all the same thing. And so like, you know, the right side, you’re just taking data as fast as you can and making it searchable. Sure, doing that it’s not a great strategy for making it fast, right? So like, anyway, all this kind of stuff was what we were, what we built, what we’ve fixed and it was great. It was so hard. I loved it.

Eric Dodds 41:15
I can’t imagine. I mean, it is interesting, I hadn’t thought about how that’s a multi dimensional problem, right? Because, of course not. Why would you have a one dimensional search problem because of this? Business? Right? That’s right. Okay, so can we talk a little, you mentioned something really interesting. So let’s use your like ecommerce to Slack. So ecommerce, like inventory, is changing. I mean, search in many ways is almost ephemeral, unless you think about, like durable patterns around like a category or a color or something like that. With slack, you have basically an exponentially growing set of, like, unique logs that serve as a very important historical archive for a company that, in many cases, becomes like, a reference for the work that people do every day.

42:16
Yeah, that is,

Eric Dodds 42:17
What did you like when you thought when you studied sort of the usage patterns of search? Did you think about this, maybe I’ll direct the question a little bit? How did you think about the problem in terms of like, I need to find that file versus like, okay, there’s this sort of, like, giant log of historical information that provides a lot of context, like, even those are very different search problems.

Josh Wills 42:42
That is deeply true, like, absolutely. So when I was working on search at SLAC, again, we were really just mostly focused on this kind of performance latency kind of problem. So we were doing fairly minimal, like relevance optimization, like other than just kind of keeping the system from, you know, like catching on fire and burning down. So like that. So this would be like, very classic, like search relevance like bm 25, ranking with a time decay factor and stuff like that, like, basically, we weren’t doing anything special, we were doing something any, you know, like an out of the box solar person could do. We ended up at the same time, like we were, you know, fixing search, we started hiring a relevance engineering team, right, to start doing like a much kind of richer and deeper sort of set of analyses of and build, like a dedicated ranking service that would personalize search for you based on Like, who do you interact with? What channels do you interact with historically, right, that kind of stuff. Introducing embeddings, like in sort of like vector search techniques to rescore and others, like, understand what a message is about, broadly speaking, is also like a very important kind of problem. Files are much easier than messages in Slack. So like files, lots, lots of contacts, lots of information, messages, little tiny snippets. So the kind of search documents in Slack aren’t just a single message. It’s really like a kind of collection of timed messages that occur in the same relatively narrow window that is used for ranking and relevance and stuff like that. Again, it was very crude. When I was there, it was just like, grab the most recent and messages within some window of time. And then like, call that the document. My understanding is it’s become much more sophisticated over time, again, using embeddings. To say, yeah, these messages are about the same issue and like Docker stuff, right? You’ve got like, they’ve gotten much better at it over time. The big thing to understand about search is it’s kind of like what you said, like a little tiny startup that has 10 people doesn’t really use Slack search, because you can just ask somebody, right? You know, you can DM everyone in the company. And then on the flip side, like IBM, IBM uses the hell out of search because you know it you know, in general, yeah, exactly. Like it’s just You can’t do that at IBM, like, it’s just not an option, right? So people use the hell out of search. So something like this, something like three fourths of searches are done by the largest of the large companies that slack like, it just sort of dominates the usage and stuff like that. And then does that transition over time, you kind of have to essentially see as an organization grows, I’m trying to think like where the knee in the graph is, but it basically goes exponential, right? It’s like searches are very low with 10 people. And then like, 100 people that’s like marginal and, but then you hit like, 10,000. It’s like going straight up basically, the number of searches they do, and like that’s, so again, it’s like a, it’s a relatively small fraction of slack users. But it’s also a relatively high fraction of like, the, you know, it’s the most valuable users. It’s the largest enterprise deals and stuff that are doing the searches.

Eric Dodds 45:53
Yeah, it makes total sense. Now, the context is really difficult. Like, it makes files far, much easier. Because from a ranking perspective, yes. You know, it’s like, Okay, someone, you know, someone in authority above me approved this in either a channel or a DM. And I remember they said, that sounds fine. And like, I’m trying to find that right.

Josh Wills 46:21
Yeah. Yeah, totally. You know, that sort of contact is super difficult. Absolutely. Completely agree. Yeah.

Eric Dodds 46:25
Okay, let’s, I saved the best for last. You know, we could get going. But let’s talk about duck dB. So okay,

Josh Wills 46:32
if you want to, I don’t talk about duck DB very much anywhere. So this is nice, what’s changed for me?

Eric Dodds 46:38
Well, it’s super, it’s a really interesting to me.

Josh Wills 46:45
Just kidding, talking about duck dB, like all the time, like literally all day, all I do all day. But I’m the reply guy on Twitter, like saying, Have you tried using DuckDB for that?

Eric Dodds 46:56
What’s interesting is you’ve been doing that for a long time. wasn’t cool.

Josh Wills 47:02
It’s still not cool. Eric, it’s definitely.

Eric Dodds 47:07
I think it’s cool. I think it’s Thank you. Why did it become cool? You know, sort of in the recent past, like, it’s been around for a while. Yep. And you’ve been talking about it for some time, and you’ve built, you know, DBT things. And so, you’ve been working on this for a while. Why do you think it’s become popular?

Josh Wills 47:31
So I, it’s a good question. It’s a good question. There’s a lot of things, Eric, and I don’t personally use RudderStack. So I can’t like comment on this, right. But there are a lot of tools, a lot of software tools that are kind of like experiential goods, is how I describe it and say, How do I sell you? How do I market to you? Or like, why you should go skydiving? Skydiving is an experiential good, right? You either do it and it’s life changing, or it’s not right, that kind of thing, where it’s terrifying. And you die, whatever, right? Yeah. This is true of so many software tools. Slack is an experiential good, right? I mean, like, if, I mean, again, if you describe slack to me in 2013, it’s like, Hey, we’re gonna take IRC, this thing from the 70s that nerds use, and we’re gonna make it kind of pretty, and we’re gonna put it on your phone, I’d be like, you know, like, great. Good luck with that, you know, call me how that works out. Right? Yeah. Chrome, how this experience for me, like you remember, like, using like, like downloading Chrome for the first time? And it was, yeah, faster, just consistently faster, right? Like that kind of stuff, way less bloated than it is today. Yeah, exactly. You know, again, like talking about, like, DBT, it was this experience for a lot of people. It’s like, it’s an experiential, good, you have to try it. The tool is an observability tool I love called Honeycomb, that is absolutely fantastic for doing kind of traces and sort of deep analysis of requests and stuff. But again, it’s an experiential, good, you have to try it to really understand how much better it is. And so, again, it’s I think, it’s tools like this have these kind of long, you know, flat curves, and then like, things go exponential as like, more and more people try it. And then the more you hear about people trying it, you want to try it yourself. And it turns out that like it does live up to the hype, crumb lives up to the hype, like that kind of stuff. Like, it’s these overnight successes, that are years and years in the making of like, yeah, that kind of stuff. I think, you know, what makes doc TV so great to me, is really just like very solid, kind of, like engineering fundamentals. Like this is not it like there’s I don’t wanna say like, there’s nothing innovative about it. There’s tons of innovative stuff about it, but from principles like database architecture principles. It’s a vector volcano model. There’s no just in time compilation using like, lol, you know, VM or any like that, right? There’s none of this other like, sort of fancy stuff that like, there’s like Facebook has this thing called V locks now, right? And it’s like, super cool. Fancy hardware custom compilation, like there’s some cool labs computer science stuff in here. Right, but you can’t just download it and use it. Yeah, it’s not that kind of party, right? Yeah. Ducky B is just the like, you can just download it, and use it with your regular boring ass CSV, Parquet JSON files, and it will just make your life better. And that’s it. It’s just exactly what you’re doing. Just better. You don’t have to run a server, you don’t have to upload any data. And it really was just like having, you know, first of all, I would just say like, not only like cron as a marker, kind of like the cofounders of the project, web labs and see why in the Netherlands before that, not only are they fantastic, exceptionally skilled database people, and I think you know, there are you know, as I do, but like database people are like, they are like the best they are, they are the best software engineers, it is the highest they are off, like five results. 100%, right. They also just have the nicest people in the world and are butterly kind and welcoming to me. I just started hacking on their stuff back in 2020. I was like, hey, I want to do this weird thing with DBT. And I need some stuff, I need to go make some changes. They base they were super nice and welcoming me who had not written c plus in anger in like, I don’t know, 10 years or so at that point. To make a sort of a community and a product and experience that was like, super open, super welcoming, like very strong kind of guests and kind of vibes and stuff. And all in a lot of software cultures that are very, like no fuck you kind of thing and a lot of ones. Sure. And so like just the combination of technical expertise with just kindness and community building. And then, I mean, I feel that man, I would just like to prattle on about how great doc DB is all the time. But like, the thing that just becomes so cool about it is how, like, when you free, like, they’re really just free to just sit back and try new cool ideas, like without having to worry about. So they don’t worry about compatibility. And then they do like their test suite is extensive and phenomenal and stuff, but like, they are free of just, they’re free to try new things and just see like, what works and like, in a way that just doesn’t feel like it’s true of a lot of data vendors and stuff right now, I guess. And I’m not totally sure why. I guess that’s true. Why, like, why are they still free? And other data vendors aren’t like that, but they are and it’s just incredibly refreshing. I think. Anyway, yeah. Yeah, that’s out there anyway, yeah. Well,

Eric Dodds 52:30
I actually want to use so many data tools, right? You know, it’s like he said, Okay, we’re gonna build these things ourselves, we’re gonna do all these open source, we’re gonna run like, you know, these really early releases of Spark and cause all these problems, etc. So your opinion carries so much weight. But I want to dig into that intangible, just a little bit more, right? And I’m going to use G. So like, my dad’s a mechanic, and he has a, you know, a set of really nice tools. And it’s like, okay, you know, it’s a ratchet, right? And it’s like, Well, okay, a ratchet is a ratchet. Right? And it’s like, yeah, but like, you know, he has this really amazing ratchet. Right? And it’s like, going, is it just his right? And so like, okay, when he lets me use it, it’s like, Oh, wow. Like, it’s so precise. It’s hard to describe like, okay, it’s a ratchet, you know? Yeah. How would you like it? Like you said, It’s experiential. Good. That’s interesting, because it’s very difficult to communicate about that without trying it like, how, I mean, you’re an advocate for it, obviously. But that’s a fascinating dynamic to me, where there’s almost like an ergonomic element to it. That’s difficult to describe in words.

Josh Wills 53:48
I completely agree. I mean, I think it’s like it also, I think the ratchet analogy is an interesting one. And I’m trying to like, I think we’re ducted the benefits is from how easy it is to get going. It really is like flip install duct TV. Yep. Important. Talk to be dotconnect. Query stuff, query CSV is right, it asks sort of so little of you to get given that compared to almost any other tool. Like, even other open source tools. Like I’ve just, there’s got to be a special circle of hell reserved for like these sort of like, quote, open source tools that make you like, sign up and create an account in the cloud. And again, apologies if RudderStack does that. I hope you don’t. Before, before you’re allowed to use them. I have some tools to do this in mind. I hate them. But like, yeah, that it asks so little of you. Yep. In order to just get rolling with it, and try it out and see if it works and stuff, right. And again, I just think that’s like this super, super key for this kind of stuff.

Eric Dodds 54:50
It’s almost like those, little friction points are just ironed out of the way but it’s actually hard to discuss Ride. What feels like flow? Right? It’s just right.

Josh Wills 55:03
It’s true. That’s right, exactly. And then kind of like, it’s just like you said, like you feel it in your hands like the rash of the way it feels in your hand. You’re just like, well, the analogy I would use would be like a good carving knife. Like if you’ve ever had a difference like, you know, carving like steak or whatever, like a really good or Yeah, it’s like a really good knife. Yeah, yes. Versus like, kind of a mediocre knife. It’s like, this is like, a quality again, it looks like a knife. You can’t tell. Yeah. And hold it. The weight. Just is just better. Yeah, I mean, that’s exactly it. Yeah.

Eric Dodds 55:33
It’s like, I’m not working as hard. And I can have more precision. But that’s like, Yeah, it’s interesting. Fascinating. Well, Josh, we are at the buzzer. This has been amazing. You have to come back on when caucus because I feel like maybe like 15% of

Josh Wills 55:56
what I wrote down. Gotcha. I would be happy to.

Eric Dodds 56:00
Thanks very much for giving us your time. My pleasure, Eric, thanks

Josh Wills 56:03
so much for having me. I appreciate it. And like I said, we’d be happy to come back on anytime. Always good to talk shop. Okay,

Eric Dodds 56:08
let’s do it. What a fascinating conversation with Josh wills. I think one of the biggest takeaways from his time at IBM, Google slack. And other companies is that he has this tenacious curiosity that compels him to build his data projects by in actually writing code in areas where he doesn’t have any authority, or, as he says, sort of any skill, which is really interesting, you know, he talked about at Google going in, and actually updating the code base for Google’s ad auction for search in order to get the right data to do data analysis and setting up data engineering pipelines there. Which was, you know, that’s very intrepid in and of itself. But even with duck dB, which is what he’s been working on, recently. You know, he said he had written a c plus in 10 years, right. And he’s submitting pull requests, and collaborating with the people who started DuckDB, because he just has this intrepid curiosity. So really incredible opposite so much to learn about tool adoption, growing a data team, learning from your mistakes, and curiosity. But I think the big takeaway is that, you know, his intrepid curiosity is really what sort of led him to all of his success. So great episode, subscribe if you haven’t, tell a friend, and we will catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.