Episode 79:

All About Experimentation with Che Sharma of Eppo

March 16, 2022

This week on The Data Stack Show, Eric and Kostas chat with Che Sharma, the founder and CEO of Eppo. During the episode, Che discusses important lessons, seeing data in context, testing frameworks, and much more.

Notes:

Highlights from this week’s conversation include:

Che’s background and career journey (4:23)
Coherence between hemispheres in the human brain (6:58)
Raising Airbnb above primitive AB testing technology (8:54)
Economic thinking in Airbnb’s data science practice (14:24)
Dealing with multiple pipelines (16:48)
Eppo’s role in recognizing statistically significant data (20:01)
Defining “experiment” (23:25)
Types of experiments (25:57)
The workflow journey (27:18)
Dealing with metric silos (34:21)
Why we still need to innovate today (37:03)
Where experimentation can be used (39:36)
How big a sample size should be (43:29)
How to self-educate to get the maximum value (45:39)
Bridging the gap between data engineers and data scientists (48:14)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 0:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. And don’t forget, we’re hiring for all sorts of roles.

You have the chance to meet Kostas and I live, in-person coming up soon in Austin, Texas. We’re both going to be at Data Council Austin. The event is the 23rd and 24th of March, but both of our companies are hosting a happy hour on the 22nd, the night before the event, so you can come out and have a drink with Kostas and I. Kostas, why should our visitors join us live in Austin?

Kostas Pardalis 0:54
Tequila, of course.

Eric Dodds 0:57
That could make things very interesting.

Kostas Pardalis 1:01
I mean, yeah. It’s a happy hour. people should garments before the main event. So without getting being tired, tired from the event or anything, like come over there, meet in person, something that we all miss, because of all this mess with COVID Have some fun, talk about what we are doing. And yeah, relax, and have fun.

Eric Dodds 1:24
It’s gonna be a great time, learn more at datastackshow.com. There’ll be a banner there you can click on to register for the happy hour and we will see you in Austin in March.

Welcome to The Data Stack Show. Today we are going to talk with Che from Eppo, which is an experimentation platform that actually runs on your warehouse, which is super interesting. Can’t wait to hear about that, but he has done some unbelievable things. Most notably, he built out a lot of the experimentation framework and technology at Airbnb pretty early on. So he was there for four or five years passes, I’m really interested to know, I want to ask him when he started Airbnb, the experimentation frameworks or experimentation products as we know them now. We’re nowhere near at the level of sophistication, right? So they probably had to build a bunch of stuff inside of Airbnb. And I think one of the side consequence of that is you think about testing is that you’re sort of creating an additional data source that ultimately needs to be layered in with say product data, or marketing data, or whatever. And that’s a pretty complex technical challenge. So I’m going to ask him how he sorted through that at Airbnb. How about you?

Kostas Pardalis 2:44
I’d like to get a little bit more into the fundamentals of experimentation, mainly, because I think it’s many people, especially like in product management, they consider they take like experimentation very likely, they think that’s just like a tool that’s like a kind of for Apple burps just tells you what you should be doing, which obviously is not the case. And especially if you are like in a company that doesn’t have anyone with the statistics background in statistics, or a data scientist, or even non-analysts. It can be very hard for someone to interpret correctly, and make sure that the experiments that are running like they’re doing, like, right, so I’d love to learn more about that, like, what is an experiment? What, how we should run them? What’s the theory behind them? Why we should trust them? All that stuff. And so yeah, also see, like, if we can get some resources from TI about how we can learn more. I think for product managers, especially it’s important to get educated before we start using these tools.

Eric Dodds 3:59
I agree. Well, let’s jump in and see what he has to say.

Che, welcome to The Data Stack Show. We’re so excited to chat with you.

Che Sharma 4:09
Yeah, I’m excited to talk about three yet.

Eric Dodds 4:11
Okay, well, so much to cover today. You’ve done some amazing things at companies like Airbnb and Webflow, but just give us your brief background and what you’re doing today.

Che Sharma 4:21
Yeah, absolutely. So my name is Che. I’m the founder and CEO of Eppo, which is a next-gen AV experimentation platform, one that’s specifically focused on the analytics and reporting side of the workflow. A lot of it comes from my experience building the systems at like you said: Airbnb, Webflow, a bunch of other places. My background is in data science. I was the fourth data scientist at Airbnb. I stayed there for about five years from 2012 to 2017, which has really guided a lot of the way I think around the data ecosystem and seeing how it played out at Webflow. In other words, it’s sort of validates a lot of those things. So some concrete things to know about Airbnb. It’s founded by designers. And so that kind of bleeds into the way that the company likes to do work comes much more from a Steve Jobs, like incubate something, focus on UX, and then release it with grands announcements more than sort of iterative data-driven measurement metrics approach, like Zuckerberg, or Bezos type of thing. And what that really means as a data team is that in addition to needing to build all of these capabilities, you also have to win over the orc into a certain way of thinking to believing metrics matter. So, in addition to building infrastructure, we had to essentially win a culture war. And I, it was really interesting to see how that all played out. To me, the biggest piece of solving that problem was experimentation. Because experimentation as a function, just unlocked so much, in addition to this concrete ROI of measuring things, it also fundamentally changes the way you do work, where suddenly people have this really intimate connection with metrics, where you understand exactly what you drove what you didn’t drive, how your work played into it, and also unlocks this culture of entrepreneurialism, where people can take risks, try stuff out, validate ideas about winning political battles. So this combination of just incredible business ROI plus, changing your corporate culture to one that’s kind of a little bit more fun was really what led me to start Eppo.

Eric Dodds 6:27
Very cool, super interesting. I have to ask one thing, and I apologize for the surprise here because we didn’t talk about this in show prep, but you did some research on the human brain and I read that it was on coherence between hemispheres. Because we are all super nerds on this show, can you just tell us a brief bit about that? Like, what were you studying between hemispheres? That’s so interesting.

Che Sharma 6:54
Yeah, it was fun. So a bit of background, in university I studied electrical engineering and a focus on signal processing. And then I studied statistics. And I really thought it was a cool way of understanding information and being able to make statements about it. I came across this researcher in the Stanford psychology department, who was trying to see if there was a different way of understanding the brain where instead of just looking at some MRI and seeing what lights up, and just seeing where the blood flows, if instead, you said, maybe the way the brain works is not increased blood flow, it’s from synchronizing things. So while there’s like two parts that are just kind of like going to different pieces, whatever, when they are focused, they just lock. And suddenly they’re a community. And it was this kind of, I didn’t know anything on there. Alright, that was election theory. So I like one of the great things about being a statistician is getting to play in everyone’s backyard and understand their fields. And so this is my way of learning a little bit about brain research. So it was really tough statistical methods to say, how do you make a hypothesis test around the synchronization of neurons? But yeah, it’s very cool. I was only working on it for about six months. So I can’t quite tell how that research evolved over time. But it was cool to learn about the field.

Eric Dodds 8:16
Fascinating. We’ll have to dig into that more. Let’s dig into data stuff. We have tons of questions about the technical side but one thing I want to ask to start out, so let’s go back to Airbnb. When you joined, I’m just trying to go back in my mental history here, dusting off some pages, but back then the AB testing technology was pretty primitive compared to what it was today. How did you begin to think about that? Like, did you evaluate vendors? Did you just start out building it yourself?

Che Sharma 8:54
Yeah, at Airbnb, we always had a bias to build over by I think you can kind of see the number of open-source projects out there. One of my colleagues at airflow used to work at snowflake and snowflake has forever been like, why he liked spending so much time and energy and stuff on rolling around in this day and age. So in any case, Airbnb has always had its own biases. We kind of knew from the start, we were going to build our own experimentation framework, one engineer built a feature-finding thing fairly quickly, that wasn’t much too much time. But then this one data scientist and an engineer decided they wanted to build the kind of end-to-end workflow and UI. So Airbnb, the first team to run experiments was the surgery. This was back in I think, late 2012, 2013. They were in this is pretty difficult. Most companies. I think the first team to run an experiment is either a machine learning team or growth team. Or maybe it was a machine learning team. Every time you iterate on a model you want to see drive revenue, not any drive like a click, but did you drive revenue? And crucially, if you’re a machine learning team, you need that evidence if you’re going to like hire for more engineers like, no, no, you did. So that is kind of early plays of investment. Once that team started showing more success than other teams started adopting it, the growth team, the rest of the marketplace, his team. And we started quickly seeing how the teams that adopted experimentation, like wholesale, like really deeply started showing their success. One of the really formative experiences for me was this search team basically reinfected enemies for us, most companies, they start on this crazy rocket ship thing or move on 3x 3x 3x. And then it was like, 2.7x 2.5, or whatever, this team is able to re-accelerate the interesting. So like, broke the logarithmic, like plane. Yeah, exactly. And we it was clear it was this team, because they ran experiments and proved it, right. So interesting. It’s a really like, amazing coalition-building moment. So I always say like, if you’re gonna spread experimentation, culture, start with some kind of teams that are going adoptive, really readily. And don’t try to push it on everyone else until you’ve shown success.

Eric Dodds 11:15
Hmm. Interesting. Do you remember one example of a test that that team ran that was kind of like, woah, this is huge?

Che Sharma 11:26
Absolutely. There were a bunch of them. Let me talk through a few. One of them, which I think was just a great example of how you draw that, right? How do you imagine all this playing out? So this data scientist, he looked, he looked at the data and saw there are all these Airbnb hosts, who basically never accepted anyone, this was back in the day before instant book was a really large percentage of your traffic. Instead, you had to request to every host, am I allowed to stay on this date? I’m bringing my dog with me, I’m showing up late or whatever. Is that okay? And and there were some hosts who just literally never said yes. And so this person noticed that these hosts were essentially dragging down the marketplace, because there are all these people who would spend all this effort vetting an Airbnb listing, messaged me. So let’s get to know and be like, expanse. And so from there random experience, we made an experiment that took the host denial rate and steadily demoted people and eventually took them off the system, if they were too strong naysayers. And that was like a huge success. It was removed metrics. And that was one of the earliest examples like this was back in 2013, or something before a lot of other experiments have come out. And so that was one of those really examples of like, Oh, I think we have something here because this sort of strategic analysis, Ben’s work, it’s, it can be hard to win over every level of hierarchy to get it done. But he can run experiments show it’s powerful battle, get reinvestment. So they’re examples like that. But then there, every company has these examples, these little changes, that seems so cosmetic, there’s like no way they could matter that much. And then they just blow the metrics out. In the case of Airbnbs, physics, this engineer ran this experiment where when you click on your listing, it opens in a new tab. So that was the entire change. It was just the open. When you click on every listing, it opens a new tab. And I think was the largest metric improvement of any experiment over five years. Turns out, it’s like very obvious. Like, got it. You do that, you don’t lose your search context.

Eric Dodds 13:37
Sure, ’cause you want to click on it. Of course, we all do that. It’s like, I want to tell you that.

Che Sharma 13:44
But man, I think boosted bookings by like two or 3%. It was like a very, very just one little change. And it’s exactly the sort of thing apps and experimentation. Like, there’s no way people would have noticed that that was a big deal. The design team probably been like article, it’s kind of like hesitation. Just every company has these stories that I always think is fascinating because it’s not just random chance. There are important lessons here. Like, yeah, not losing your search context matters a lot in an Airbnb system.

Eric Dodds 14:15
Yeah, totally. One quick question— two more questions. I’m monopolizing here and I know Kostas has a bunch of questions. First one, it’s interesting to think about the economics, right? So when you talked about hosts who never replied, that’s almost like calling a store and saying, do you have this? And then you go to the store and it’s not there, right. And so over time, it’s like artificial supply, which is an economic problem. Did you have to apply a lot of economic thinking in the data science practice at Airbnb?

Che Sharma 14:47
100% The person who did that analysis wasn’t a Ph.D. economist, so actually think that data science, there’s a lot of skills that go into it. There’s like a straight engineering piece of just how do you make reliable robust data systems. But when you talk about the ultimate goal of data science, and it’s something I always try to, like, kind of confirm and validate for people is that the whole reason you start a data team is not to like, have a data warehouse or the modern data stack, whatever the whole point is to make better decisions. So you need to understand what data what analyses What can I do that’s going to lead to better decisions? And economists have a lot of background in sort of thing. Like, it makes a lot of sense.

Eric Dodds 15:34
Yeah, absolutely. Okay, this is going to be kind of a detailed question. But hopefully it sets the stage for Kostas to ask a bunch of technical questions as well. But one thing I’m interested in— and I’m thinking about our listeners who maybe are dealing with a vendor, AV testing tool, or maybe have built something themselves, or even just trying to think about how to process this. So you said, someone built a simple feature flagging mechanism at Airbnb. So one of my questions is, and this is sort of a problem that every company faces, or at least my purview, which is not something that every company faces, but I guess you have feature flagging in the context of testing and data science. But then you have this problem of you kind of want that feature flag to be available in multiple places. But generally, you’re also running like a separate product analytics infrastructure instead of pipelines, you have growth, you have customer success. They’re all those components. How did you deal with that from a technical standpoint? Because you hear about building your own feature flagging thing and it’s like, does that actually make it harder to deal with all these other pipelines as well?

Che Sharma 16:46
Yeah, it’s a great all that. So it touches on what I would call the modern experimentation stack, is that to run experiments, you basically have these pieces, right? You have one way, which is feature flagging randomization. So that’s the start of the technical stack, which is people arrive, they got to be put into groups. And that’s something that involves making a bunch of clients for, you have a Java service, you got a iOS thing, everything, you kind of have to get it across all these places, and you need some central way to manage them. But I think what I think is funny is that pretty much all the commercial world thinks that’s actually it, that’s where it ends. And so you’ll see tools like Optimizely LaunchDarkly, which pretend that data warehouses don’t exist, they just write a feature flag, and then okay, let the data scientists sorted out. And that’s kind of what the gap we’re trying to fill. So our product today actually does not have feature flags at all, although we’ll probably be building a face in. Instead, what we rely on is this basic separation of where feature five meets Analytics, which is the data warehouse. So all of these feature five tools, even if they don’t directly give you data, it’s very easy to build your own instrumentation and get that data into your snowflake, and or BigQuery, or redshift, or whatever. And as tools like Retrospec show like there’s this amazing new ability to get everything into the warehouse. And so nice central point to operate off of for applications like ourselves. So modern experiments stack, you got feature flagging, you have metrics, which inevitably are operating at a data warehouse, a bunch of pipelines to kind of intermingle these things, calculate your quantities, revenue diagnostics, do your investigations, and then reporting, which is this very public, kind of cross-functionally consumed interface, which is answering, how are my experiments doing?

Eric Dodds 18:41
Totally. Alright, I’m going to be so rude to Kostas and actually just continue to ask you questions because I can’t stop. I can’t help myself. This is so interesting to me. I’m coming at this, just so you’re aware, as someone who’s worked in marketing and done a lot of data stuff and used a lot of AV testing tools. It seems like the package solutions for AB testing, like their value comes a lot from basically sort of handling the statistical analysis as a service. You do your test, it says this is like variation one, and then it tells you like, this is going to improve your conversion rate or not. But the challenge has been that they keep all the data trapped inside of their particular system, which I think inherently limits the value of that data because ultimately, you want to actually see that data in the context of all the other data you have. Is that kind of the idea behind Eppo? Is that you’re sort of not creating obfuscation around the data itself and just providing like a, yes, this is statistically significant or not?

Che Sharma 20:01
There are a few pieces that go into what makes Eppo Eppo. I think specifically with regards to the data, where it lives, what it comes from sort of thing, I think our standpoint, which is what you will see, as a principal at Airbnb, or Netflix, or Google, whatever, is that there should be a single definition of like, what is revenue? The data teams are singularly focused on defining that thing, what is revenue? What is a purchase? What is this subscription upgrade, or whatever. And the natural home of those things is your data warehouse. There are two points of failure with most of the existing systems. One is that they create their own parallel data warehouse. So suddenly, they got their own idea of what is right, and it’s hard to really sync it up with your own. In addition to revenue itself, you want to split revenue by a bunch of other things, right, by marketing channel or by persona, whatever. So that’s one thing is that having an incomplete and parallel version of what data is drives data seems insane. Like, it’s like, I spent so much time trying to define this thing. And here’s the system they’re telling a PM that they increase revenue, when the revenue doesn’t even include this other source, like that is inaccessible by the system. So that’s, that’s one piece of it. And that problem gets exacerbated by different business models. If you have multiple points of sale, then like, trying to insert each one separately, doesn’t really make sense. It’s centralized in data warehouse, if you use Stripe, Stripe is not a set of data that is accessible by those systems. Yeah, a bunch of things like that. Yeah, that’s definitely a core piece of it. But other piece is almost like a design principle around organizationally, how should experiments be run, because one of the things that I run into all the time is, you see a lot of companies that will, they’ll match their organization to the Tools instead of matching their tools to the organization. And it’s really unfortunate, because it means it puts a lot of stress to hire high expertise, statistician types, economists like to actually be able to run experiments in the way that you’re supposed to follow statistical protocols to have good procedures. Whereas a tool should just enable those sort of things. So the way we operate we operate is that a lot of companies might have 123, what are called scrim specialists, who have opinions around what metrics matter, here’s how they’re defined, use a statistical regime we’re going to use, and we want Eppo to let them scale-out that knowledge where they can do a one-time definition, say here the rules of engagement. And then going forward, some junior pm fresh out of college never done this before, can just operate within the system, turn the crank of being like, Look, just by using the system, I am following all the guidelines that help us.

Kostas Pardalis 22:50
I’m probably going to disappoint Eric a little bit because he’s expecting for me to ask something very technical, probably, but I want to start with something very basic. I want to ask you, what is an experiment? Because we keep talking about experimentation platforms and like all these things, but let’s start from the basics. What is an experiment? What defines them?

Che Sharma 23:15
I love it because it’s the basic questions that are the most technical, actually. I’m going to give you my simple answer. And I’m going to give you my Galaxy brain outside. So the simple answer is an experiment is, it’s a methodology that you’ve probably learned about in grade school, where if you have a theory of what is driving change, you take groups of people, or a group of something, you flip coins bunch times, split them into multiple groups, and then you measure who did better you get one of those groups, I think that showing people making people do a morning walk every day will lead to lower diabetes, whatever, like you have one group tell him to walk every day, and then you measure how much diabetes you got. So that’s the basic methodology is irrespective of what type of A B experiments basically that you need to have some random way of dividing people into groups, we need some way to measure success, which are metrics, and then you can try out different ideas for what drives success.

Now, my Galaxy bringing approach to this, which is that an experiment is anything that has a kind of before-after comparison group that says like, did this group do better than that group? And what’s interesting in the world today is that if you looked at your Netflix, there are a bunch of products that you ship that don’t lend themselves well to AV experiments that let you kind of divide the world into like, think of like a pricing experiment or going to give half the people one price, half another price for a very kind of well-known product. Or if you’re Netflix and you watch Stranger Things, like that’s actually the most important thing In Netflix awakening here is, did you get an ROI on Stranger Things? So there’s a kind of rich suite of things of causal inference methods that try to figure out like, good metrics move once you control for a bunch of factors. And the algebraic answer is that, well, that’s also kind of decision science.

Kostas Pardalis 25:25
That’s super interesting. As long as I remember hearing the theory of experiments, I can take at least, we are always talking about, like AB testing which, as you said, as you described right now, it’s about splitting your population into two and run the experiment there. Is that the only way that we can do experiments?

Che Sharma 25:50
That gets back to that the simple word experiment and the Galaxy green version, to do a basic AB test, you do need some, like random, or uncorrelated with a metric way of dividing people into groups. So what the nice thing about these online platforms is that dividing people into groups randomly is actually a well-solved problem, it’s actually probably the easiest part of the workflow. So if you have the ability to randomly split go into groups, it’s kind of the best way to do it. Now, there’s a kind of depth to this topic, like what happens when there are some users in the group who have a way outlier, disproportionate effect. And you can try to randomize them, but they are just gonna overpower everything. How do you deal with that? There’s a set of methods called stratified sampling, ganz reduction techniques and a bunch of ways to do it. But there are ways in which the random sampling thing can break. And again, it falls back on tools like Eppo to try to make you aware of them.

Kostas Pardalis 26:56
You mentioned that this is mainly like the easiest part of the workflow, so what is this workflow? Take us on a journey of working with the product itself, like how we start and what do we have to do until we get a result?

Che Sharma 27:17
Yeah, absolutely. So let me walk you through what I would call the experiment lifecycle. And then from there, I’m happy to dive into touches on all of it, the start of an experiment, is to have a basic alignment. I’m like, what are we trying to do here? Like, are we trying to increase purchases? Are we trying to reduce customer service tickets? What is our overall goal? Just have some idea of like, this is where we’re what we’re our goal is to do and the corollary to that is, you need a metric for it. You need some ways, like, what is a customer service ticket? What is a purchase? From there, the second stage is you need to come up with hypotheses of how am I going to drive that metric? Is it that we want to reduce complexity and reduce friction due on to increase urgency, increased desire? Are there social proof things, you just come up with a big list of stuff, right of saying, like, here’s all my ideas of how I think we can improve things. From there, there’s a kind of basic product approach. What is expected impacts for each one, what is expected complexity, etc. So you come up with hypotheses, and you have some way of deciding which ones you want to do. From there, you have to design an experiment. And designing an experiment is both the product side of like, UI UX type of thing, but also there’s a statistical piece, which is called a power analysis. So basically, to actually get signal out of this change, how many people do you need, how long you need to wait to actually get it. So you need to have enough sample size need to be able to get a signal. So that’s our call-of-call kind of experiment design. From there, you have to implement it. And implementing the experiment is where you touch on the feature flagging side. It’s also where there are three product buildings, like you hopefully implemented without a bug, hopefully, it’s an experiment, you didn’t break it on iOS or something, or push some important design assets below the fold. So there are station details there. From there, the experiment runs for a period of time, and you want to observe and make sure it’s healthy. Now, this is one of those tricky things where experimentation has this central issue where from a statistical standpoint, you shouldn’t peak too much. She shouldn’t stop experiment early. You shouldn’t really examine it too closely until it’s done. But that’s just not a reality. For most organizations. The real political capital is being spent on this thing. You can’t afford to let an unhealthy or unproductive experiment take up weeks of product line. And so figuring out how to navigate that is also something that Eppo we have some opinions on the way to do it. From there, reach the end of the experiment. You have to make a decision and A lot of times that decision might be simple. It’s just like, did the metric go up or down? But sometimes it gets complicated. Like, what happens if one metric goes up and one metric goes down? Like what happens if gravity goes up and customer support tickets goes up? What do you do there? So kind of navigating that decision higher that metric hierarchy is a big thing here. And then from there, you have to make a decision, right, actually go forward and say, I’m going to launch out, I’m not going to launch it, sort of thing, and record that for posterity. So there’s a bunch of stuff involved here. And I always think it’s just very telling that the commercial tools touch on like, such a limited number of it.

Kostas Pardalis 30:41
From like a mother’s experiment lifecycle, which sounds like something quite complicated, to be honest, they’re like quite a few steps there and many different stages where something can go wrong. So how much of this life cycle you can cover right now with the products that you have?

Che Sharma 31:01
Yeah, we right now we are we do a lot of the post-implementation details. So you like actually, that’s not true. So the things we do are, first, we let you build up this large corpus of metrics. So that includes your important stuff like revenue, purchases, etc. It also includes your more bespoke stuff like widget clicks, or whatever. And so all that becomes addressed one system? And if you want to see like, for revenue, like, how did all my experience doing history, like? What were the biggest revenue drivers left me serving drivers, whatever you have those views from? We also from there, we help you kind of after implementation to say like, how long should experiment be running this power analysis thing. So we make that we automate that makes that self-serve for you. And we automate all these diagnostics so that if there’s some issue with randomization, it’s not actually random, then we will alert you on that make you aware of it. If there’s a precipitous metric drop and make you aware of it. And then, at the end, for kind of guiding your decision, we have a sort of call opinionated design interface that is meant to say like, again, if you have some junior pm fresh out of college, and they need to make a decision here, how can you lead them to the right direction allows you to incorporate their opinions? And what metrics matter? And what metrics are sort of exploratory? From there, all the experience gets pounded into this knowledge base. So here’s where you can look through all the past experiments. See, has anyone tried this before, understand the picture of like, what have been the big drivers historically. So that’s where AI system touches today. I think as time goes on, we’re going to be reaching further and further back into the planning cycle. So today, we do power analysis, once it started, pretty soon we’re gonna do power analysis before it starts. So you can actually do a kind of scoring of the complexity of these experiments going in. And then just kind of group experiments according to like, what? Like, here’s a bunch of experiments that are all related to driving search ranking personalization balls or something.

Kostas Pardalis 33:15
You mentioned metrics or lows, like our conversation so far. And it seems like having a good understanding and good definition, but he served among all the stakeholders of what the metric is, is quite important, right for the middle of breaking the results of like, whatever experiment you’re doing, and that, that I can’t stop thinking of like all the different places that we keep metrics in organization. We keep talking a lot about data silos, but they’re also like metric silos, but we don’t really talk about that. But it’s really easy to create the same metric to recreate the same metric in many different places with even like slightly different semantics. But this might make like, a big difference. So how do you deal with this problem? And how do you relate with this whole movement that we hear a lot lately about metric layers, metric repositories? And how your product like works with that?

Che Sharma 34:20
Yeah, absolutely. So I have been very heartened to see that all these metric layers and metric group, those haven’t taken off, right, because I’m obviously a big believer in it. And experimentation systems get a lot more powerful when companies have a clear definition of metrics. So the ICR system integrating well with those metric layers, we are one of many downstream processes that should operate off the single definition source. So, there’s a little bit of let’s see which ones to catch on and know what the right integrations to do are, but it’s definitely in our interest to play well with them. In terms of how we deal with them, I think there’s There are two things. So one is experimentation as practice gets more powerful by the diversity of your metrics. And to give an example of that, suppose I’ll tell you story of Airbnb, there was this, there were these two teams, one team was focused on driving this instant book feature. So I’ve mentioned you before, it’s very annoying that both have to approve you. So if a host just says, Look, I’m just gonna accept everyone. And that became a strategic thing that we’re trying to improve. So it started out there was like, what, like two or 3% opposed to add it. And today, it’s like 80 85%, like a really, really huge change. Yep. And so there’s one team is just running a bunch of experiments can stack simultaneously, there was another team, which was trying to make it so that when you use Airbnb, you sign up much earlier than you currently get. So me is an app where you can get all the way to checkout page and never create a user account, right? And so they were experimenting with various ways of incentivizing people to create user account or young and different parts of the building teams that might have hung out socially, but not really sharing roadmaps and stuff like that too much. It turns out that experiments that drive signup, rates will actually have a crazy effect on driving up instant book rates, because this instant book feature was gated behind signup. So it’s exactly example where things like surface area, these people metrics have interactions between them. So if you have some ability to say like, I am the Business Travel team, all I care about me business travel, I just want to see how every experiment affects it. Like these sorts of views become super important. So from our standpoint, we’re happy that there’s been a standardization of metrics, philosophically, we are 100%, in the direction of saying companies should have single source of truth around them and we aim to build off the systems. Experimentation is a very metrics help hungry.

Kostas Pardalis 37:00
Makes a lot of sense. Experimentation platforms have been around for quite a while. It’s not like something necessarily new. Why do we need to innovate today and bring something new in the market? What’s the reason behind that?

Che Sharma 37:20
There are a few answers here. So one is that experimentation has existed in this feature file-centric world for much longer then what you might see it in me or Microsoft back in there. And so I think that the scope of what experimentation includes has widens, where you include core business metrics, it’s no longer a CRO consultant who’s trying to drive signups off your marketing pages now, no, you’re trying to drive core OKRs at the company, by an experimentation strategy. But part of the reason that that has been enabled is the rise of cloud infrastructure, where suddenly a lot more people are working off these kinds of common set of tools that make it very easy to do these complex workflows. I think of Optimizely as a company they might have even been a little bit ahead of its time, 2008 when they started, like when at we could not build Eppo in 2008. Because instead of integrating with snowflake, redshift, and BigQuery, whatever, I would have to integrate with, like both Teradata and hive clusters, and pay clusters, whatever, all sorts of like SAS or something, I know, it would have been a much tougher argument to say, how do we integrate with these databases. So the entire analytic side was just really hard before in a way that is now become possible to deal with. Also, now, most companies are operating off in AWS, or GCP, or something like that. And so to have a sort of cloud infra place, Reagan kind of quickly turned experiments on and off, but them up and down, had this very iterative process, this continuous integration environment is now just much more common than it used to be.

Kostas Pardalis 39:13
Makes sense. The first thing that comes in mind when some other like house here to like work with experimentation platforms before is products. Like it’s a tool that it’s heavily used by product teams. Where else can we use experimentation? Or do you think that it’s like a tool that is reserved only for product managers?

Che Sharma 39:36
Experimentation today probably has two big homes. One is on product development, which is an I think that’s a much more expansive definition than just growth and ml teams have literally changes to code as easily do it. The other big place I’ve seen a lot of expectations marketing. Now, if you think of ad campaigns, and some of the management of growth marketing, that’s another place that’s very experiment heavy. So those are my two biggest buckets, I think, where else to grow from there experimenting on kind of operational teams is something you’re starting to see more companies that boy, such as, like a sales team, or a customer service team. Like UPS, they can experiment on their fleet of drivers or whatever. So, that’s kind of an emerging area. From my standpoint, the product side of things, and the growth marketing side of things are just these like hugely growing industries. Now that we have product growth, we have more bottoms-up self-serve motions, etc. that make it just really attractive.

Kostas Pardalis 40:46
As a person who has worked in product, I always thought of— and by the way, my background is mainly b2b products, so the first time I tried to use in a b2b product like experimentation platform, I failed miserably, to be honest, and I developed this way of thinking that experimentation platforms are mainly for b2c companies because you need to have volumes that will drive like this experiment. Is this true?

Che Sharma 41:20
The thing with sample size in experimentation is that there’s it’s really around what sort of text size you’re trying to get. So you are running experiments, when you don’t have too much sample size. Like there’s still value in just preventing or indispose, right? There’s sort of a look, I just want to make sure my I know, metric dropped like 10%. If there’s some like very major issue. So when you look at b2b enterprise companies, you see experimentation, play out much more as hygiene, as like, we just need to make sure that everything is healthy, or that if someone has allied success, we know about it. But I would say that the business models that are much more levered on experimentation comprehensively beyond that, are these consumer prosumer. You arrive at a website? Yes, sign up and purchase. Today, every startup has to have a kind of focus to start with, we like to focus a lot on means consumer person or companies.

Kostas Pardalis 42:17
Yeah, yeah. Makes total sense. Let’s say I’m a founder and like, I’ve stopped building my company and my product and I’m looking like for product-market fit. Okay. Where should I introduce an experimentation platform? Is this something that I should be using before product-market fit after? Is it something that can help me like find product-market fit faster? Like, you’re also founder? So what’s your opinion of that?

Che Sharma 42:48
My opinion is that basically, once you have the sample size, the ROI and experimentation is so clear, that you should really be doing it. Like it’s basically saying, Do you want to measure production this or that? Like that? That’s the basic thing. And the answer is clearly, yes. So now happens to be the case that to have a sample size that lets you run experiments really well probably means you have some amount of product-market fit. But I think the main thing is just sample size. It’s like, can you actually want an experiment at all? Because once you can, it’s just really clear.

Kostas Pardalis 43:20
Can I ask you to give us some kind of sense around what this sample size should be?

Che Sharma 43:29
What you should think about is what is the most common behavior that you care a lot about? So maybe you don’t care too much about signups, but maybe you do. If you’re saying Webflow, maybe you care about people publishing sites, or something like that’s not exactly a subscription. It’s not your Northstar revenue-based metric. But it is something that through all worthless history, they have noticed driving publishes is a powerful thing. So what you might want to do is to say, Okay, I have this many users. And when I have this many signups every day of those people, here’s how many are publishing. From there, I can plug it into their online calculators for power analysis, we will build the building our own, there are different ways to conceive it. And then from there, it’s just saying, like, what is your comfort level of running an experiment for three months, or two months or one month or two weeks, whatever? And once you have an answer to that, that feels comfortable, which is that you’re not going to lose a lot of product, time product speed, by waiting and being blocked on this experiment this amount of time, and you should do it.

Kostas Pardalis 44:39
Yeah, one last question and then I’ll let Eric ask any questions he has. I think it’s clear that having like it, the more a person company that knows about statistics is like very useful when you operate these tools. But that’s not always the case, especially when we’re talking about younger compounds. I can’t imagine many companies that when they start, they have like a data scientist or unless the product is really related to data science, what do you would recommend to a founder or to a product manager that doesn’t have access to any statistical resources, how they should educate themselves in order to maximize the value that they can get from these tools? Like, for example, you’re talking about power analysis, how can I learn about power analysis? How I can do this thing?

Che Sharma 45:31
Yeah, absolutely. It’s funny, you mentioned that because one of the things we’re going to be working on is a modern experimentation guide. Like one of the things that’s sort of tough is that there’s a lot of content on the web. That’s all fragmented everywhere, right? Yeah. So it’d be nice to kind of compile that the, I think there are, let’s say two or three resources, I would recommend if you’re a product manager, and then trying to get given experimentation. So the first I think the reforge program is great. A bit. So it’s basically a product management MBA for product managers is I think how they self-described themselves, it naturally tends to lean a lot more on quantitative methods and experimentation than many other places. So I think that’s a really great thing. I think it’s a great use of your learning budget. Another great use of your corporate planning budget, if you haven’t, is Lenny’s newsletter. So when you were Chesky, they are me, alum of ours and also an investor, he has that slack group, the related to the newsletter is really informative. I’ve personally learned a lot from it. And there’s a lot of experimentation talk, I like to contribute there. So that’d be another great resource. And then the third is probably the closest thing you can call to an experimentation Bible is the book of ironical hobby, online controlled experiments. LSVT radical Javi, for those who don’t know, was a pipe was one of a pioneering team of experiment scientists at Microsoft. So Microsoft, especially back in the day, really pushed it and pushed the frontier on what was possible with these online platforms. Running cop, he is probably the one who did the most evangelizing of that in, in his book and his talks and stuff like that. And so that’s a great resource to read through, very readable, and I believe now he actually even has an online course. So that might also be a great venue. And of course, for any of your listeners, I love to add people, so if want to just email me or DM me on Twitter, I’d love to Chapter whatever experimentation topics you have on your mind.

Eric Dodds 47:34
Cool, super helpful. Okay, one more question Che, and hopefully this doesn’t push us over time. But we’ve learned so much about sort of testing frameworks and methodologies. You’ve seen both sides, the infrastructure side and the testing side. And so I’m thinking about our listeners who are maybe a data engineer who isn’t as close to the data science team, or machine learning side. Could you just help us understand how should data engineers who aren’t as close to that work with data scientists? Like, what are your thoughts there? What do you have that could be helpful for those listeners?

Che Sharma 48:11
Yeah, absolutely. So I think the starting place to establish is to just say that, were the reason I started an experimentation company, is that experimentation just unlocked. So many of the things that data team wants to do data teams, again, they exist to drive good decision-making with a value system that tends to be more metric-based in database than other teams. And experimentation is just an incredible cultural export of those values to the organization. It really just helps people engage with data in a much more meaningful way, that does not require nearly as much cognitive load to like build into your decision-making process. So I think a starting points to just say like, if you’re a data engineer, your work will be a lot more impactful in the organization, if you have an experimentation program. So that’s the starting point, in terms of how do you enable it. I think, if you are a data engineer, it’s a lot of it’s very similar to what you might see in most other data engine practices, the things you can do is provide great definitions of metrics for things that drive the business candidates for things where this goes up, we’re all in great shape as a business, and then to great, great API’s to them. So like, is it very easy for data scientists for PMS to utilize those metrics, easy to plug them into tools like Eppo, do, you have great ways for people to build an affinity and understand what drives them? I think that those same topics that probably apply to pretty much everything else, a data engineer does apply heavily to experimentation.

Eric Dodds 49:51
Okay, super helpful. I also think that the accessibility piece for data engineers is really helpful because it’s hard actually just to collect the data a lot of times and get everything in one place. Do you have a story that sticks out to you about a context of a data team where maybe they were sort of behind the scenes and then experimentation or something happened where all of a sudden it was like, wow, we’re best friends now?

Che Sharma 50:21
Oh, yeah, absolutely. Experimentation tends to really create new relationships in the org, in terms of where the work just became a lot more visible. At Webflow, we spent a lot of time trying to quantify, when has someone onboarded with his own activate to the system. And if I don’t use the product, it’s quite complex, right? It’s, it looks a lot like Photoshop, it’s got buttons everywhere, it’s brimming with power. But you might have to take a Coursera class or something to learn how to use it. So learning curve is like this big issue. And we had significant data resources going towards trying to find levers improve it. Now, the thing is that you can have a lot of different theories around how to improve activation. But it really helps if you can just show that your theories are correct, because you built an experiment against it, and you drove the metric. Hmm. So that there was this growth team that spends a lot of time trying to craft activation metrics and levers. But once they’re running experiments the whole product org was aware of the experiments that were being run, and how they were trending, we started seeing more product teams want to run experiments themselves as a result, which is great. I always say, one of the cool the blessings, and the curse of experimentation is that it’s a very public process, it draws a lot of eyeballs, it’s like, it has very much of that, like Man in the Arena quality, where this, this team is gonna go, and they’re gonna expose themselves to, whether they were successful or not, in a way that most product teams don’t expose themselves. So it’s great, the right type of product team lives off that stuff. And then if you’re a data engineer data practice, you got to feed the data into that process. Now, your tables have to be pristine, you have to arrive on time, meeting an SLA on the data engineering pipeline, becomes much more critical for an experiment, and that I think naturally leads to more resources in an area.

Eric Dodds 52:20
Absolutely. Okay, one last thing. If someone’s interested in looking at Eppo, where should they go?

Che Sharma 52:27
Just go to geteppo.com. That has details about a product and also has a link to reach out. I’m also like I said, I love chatting with people, whether on Twitter, Slack, LinkedIn, or whatever. So you can reach out to me chicken Sharma, on any of those mediums of love to get in touch. I love talking with anyone who’s interested in experimentation, no matter what the maturity stage or readiness for a product like ours is. So we’d love to chat with whoever.

Eric Dodds 52:54
Awesome. Well, Che, thank you so much for the time we learned so much, and just really appreciate you taking the time to chat with us.

Che Sharma 53:00
Absolutely. It’s been a pleasure.

Eric Dodds 53:03
I’m just constantly struck, I think every single show, I just am amazed by how smart the people that we get to talk to are and what they’ve done. And Che of course is no different studying synchronization between brain hemispheres and then building a statistics practice inside of Airbnb. Pretty amazing.

Here’s my takeaway, and this isn’t super intellectual, but it’s enjoyable, hopefully. I really appreciate that. Even though it’s clear that Che is bringing a huge amount of sort of knowledge and experience into building this technology that does some pretty complex things, especially on the statistics side, he acknowledged that, you know what? it’s like the small, dumb things that can make the biggest difference in this world, right, like opening a link in a new tab. And it was funny just hearing him talk about that and seeing him smile because that’s as simple as it gets. But it was the winningest experiment over a five-year period of Airbnb. So I just really appreciate that. Like, we can throw as much math and technology at this as we want. And sometimes it’s just a really simple idea that wins the day.

Kostas Pardalis 54:21
Yeah, 100%. I think one of the misconceptions around experimentation is that the experimentation process is going to tell you what to do, which is not the case. You have to come up with a hypothesis, you have to come up with what matters and why and the experimentation platform fidelity is there to support you in the decision that you’re going to make. And that’s what we have to keep in mind. And that’s how I think we should be using these tools as another tool that can help us make the right decision. In the end, and one of the things that he said that I think is super important is that these platforms and these methodologies also provide the credibility that is needed in order to communicate more effectively, whatever decisions you propose to the rest of the stakeholders. So that’s what they keep from the conversation today. I think it was a way to, let’s say, give a very realistic description of what an experimentation platform is, what we can achieve with that, and what to expect from them.

Eric Dodds 55:34
I agree. Well, thank you for joining us and learning about AV testing today. Lots of great shows coming up, and we’ll catch you on the next one.

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 79:

All About Experimentation with Che Sharma of Eppo

March 16, 2022

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter