On this week’s episode of The Data Stack Show, Eric and Kostas talk with Ryan Boyer, principal data scientist at Shipt. Ryan, who’s been with Shipt since before its acquisition by Target, offers insights on the evolution of data science at the company and more in this conversation.
Highlights from this week’s episode include:
The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 00:06
The Data Stack Show is brought to you by RudderStack, the complete customer data pipeline solution. Thanks for joining the show today.
Eric Dodds 00:15
We have a guest on the show today who I’m particularly excited to talk to because I’m a customer of theirs. And it’s the company Shipt, they do grocery delivery, and now all sorts of other stuff. And actually, we’ve been customers in our household for a long time. So I remember when they got acquired by Target. And one thing that I’m really interested to ask Ryan about, who’s a data scientist at Shipt, is just the complexity. So if you open up the Shipt app and use it, there’s so much going on there, even just from the consumer side. And I can’t imagine the challenge of dealing with all the different data sets that they have in terms of building models and sort of just managing the entire data science practice. So complexity is my burning question. Kostas?
Kostas Pardalis 01:02
Data. I’m pretty sure that they have a lot of data that they’re working with. And I want to see both how they grew from the first day until today in terms of like, the data itself and the infrastructure behind it. And so what are the challenges around that? And keep in mind that we are talking about a marketplace at the end, which always complicates things, although we tend to see only one side, the side that we are part of on the marketplace. So I’m pretty sure that he will have very interesting information to share about how data is important to growing market places.
Eric Dodds 01:35
Absolutely. Well, let’s jump in and talk with Ryan Boyer from the data science team at Shipt. Alright, Ryan Boyer, welcome to The Data Stack Show.
Ryan Boyer 01:44
Thank you so much for having me. I’m really excited to be here.
Eric Dodds 01:48
Oh, man, we have so many things, so many things to ask you about. But why don’t you just give us a brief background? So you know, where did your career start out? And then what was the pathway that led you to data science at Shipt?
Ryan Boyer 02:00
Yeah, this is a great story. So I got a math degree at Clemson University for undergrad, and then opposed to going to grad school like I initially planned, upped and moved to Bozeman, Montana, where I became a ski bum, a very terrible ski bum and stocked store shelves at Target for about a year. Six years later, here I am working at a company owned by Target solving problems about out of stock products at a national scale. So very much I’ve come full circle. How I got into data science and how I landed at Shipt is a little more direct. After learning that I was a bad ski bum and wanted to use my brain a lot more, I went back to grad school, got a degree in system and information engineering, focusing a lot on data science, math, statistics, and then ended up in Birmingham, Alabama, because it was where my wife grew up and wanted to do data science in a small southern town. And there was really one or two options. And I got lucky and joined at that point in time a decent size startup named Shipt. And it has been rocketship growth ever since then, I was the third person with the title data scientist. And now I’m on a team with I think 50 people in the data science organization. And we’re always hiring, as far as I can tell. So it’s been a lot of fun.
Eric Dodds 03:19
Yeah, that’s great. We encourage our guests to tell our audience when they’re hiring, and it seems like data science and data engineering roles are just in huge demand. One question for you. I just love the story of you stocking shelves while being a ski bum. Did that influence sort of the way that you thought about solving problems around stocking and stock items when you were working on that from an actual data science standpoint at Shipt?
Ryan Boyer 03:52
I would say it certainly helps, right? Like I understood that Target doesn’t just get one truck a week, you know, there’s lots of trucks a week, they come at different times. And so there was some, like domain expertise I could bring to that problem. But I would say that the bigger thing that I learned, honestly, throughout all of my undergrad career, and especially through my time as being a poor ski bum, poor in the sense I wasn’t very good at it, was how central people are to data science, right? Like, I can build a great model, or potentially build a great model that predicts products being out of stock, whenever it doesn’t actually impact or affect people or enable people to prevent out of stocks or account for whatever it causes, like then it’s not really useful. And so I would say that really is kind of like, the key driving thing for me as a data scientist is how can I make a model or systems that work for people and with people?
Eric Dodds 04:50
So interesting, can you give us just one one example of sort of what it looks like to go from model to individual in some of the work that you’ve done? Like just a practical example for our listeners.
Ryan Boyer 05:04
Yeah. So I will say this is probably the hardest thing in data science, to me, in my opinion is managing that stage of a project. So we can talk about out of stock. So we can get into the model more later, if you’re interested. Basically, at Shipt, I’ve built a model that predicts whether a product is out of stock in real time. You know, from a data science side, we get a score of between zero and one basically a probability of it being out of stock. That’s great. And I can do metrics about how effective that is, and all kinds of things. But the question is, like, how can I use that to improve the lives of our shoppers, Shipt shoppers, or Shipt members who are either picking the groceries in the store or ordering them on our e-commerce app? And so there’s a lot of discussion about how to do that well and effectively and manage their expectations? I personally believe that data science models should be thought of as tools, and not solutions. Like no one looks at a house and then thanks the hammer, right? They thank the carpenter who used the hammer to build the house well, right. Like, I feel like data science is like, oh, man, deep learning neural networks, gradient boosted trees, like we can solve all of our problems with these cool tools. And I would say, No, you can build lots of great tools that can be really effective at your problem, but you still need to wield them effectively.
Kostas Pardalis 06:25
I think we see a pattern here we’ve seen before, right? Like, actually two patterns. One is like how you actually productize data science, which is not that obvious. And we will have the opportunity, I think, to discuss more on this today. And the other thing, Eric, that we have talked a lot about before was actually data science, machine learning and all that stuff, it’s more of a tool, and they augment and they have to work together with people. It’s not a replacement for the stuff that we are doing as people. So I’m pretty sure that you remember, like all these discussions we had, like in the past on other episodes around that.
Eric Dodds 07:03
Absolutely. Yeah, I’d say it’s been a huge theme with data science. Actually, we’ve had multiple data scientists on the show. And it’s been really encouraging that the most common perspective is the human element of data science is the key determinant and whether data science is actually effective or successful.
Kostas Pardalis 07:26
I don’t think we’re going to see that Terminator at the end anytime soon.
Ryan Boyer 07:31
No, me neither. Yeah, I would also say the human element is important on both ends, right. Like getting data science to production, it really matters that you have a company and a culture who has bought in, willing to invest, willing to work with you, and willing to buy into the vision of how a data science tool can be used; so there’s like that front end part of data science being successful. And then there’s the part of just like, can you build a data science model that affects your business, or the people who use your business in a way that supports and helps them opposed to just trying to absolve away all control and what they like about the business in order to give them best outcomes?
Eric Dodds 08:10
Sure, we had someone on the show, who talked a lot about AI. And, you know, people sometimes can have a fear of AI, and blame the technology. And he said, you know, if you see negative results from that, you have to remember, there’s a human behind that, you know, whether it’s sort of building a model or a proving, which was very thought provoking.
Ryan Boyer 08:36
Yeah, I mean, I would also say, sometimes us, people who are building things can miss things too. And so there are sometimes mistakes, but I don’t think that’s an excuse for a data science professional to build something that is manipulative. But yeah, like, there are people building things. And in my opinion, will be building things for a long time.
Eric Dodds 08:58
Sure. Yeah. Well, I want to get into the technical side of things. And I know Kostas has a bunch of questions there. But I think as a segue, getting into that, one thing that would be really interesting to hear about is you joining Shipt pre-acquisition, the third person on the team with the title data science, and you’re going to have a huge team by the end of the year. I’d love to know what has changed significantly post-acquisition and what hasn’t changed that much.
Ryan Boyer 09:29
Yeah, so like you said, I was the third person with the title data scientist when I joined, our data science team was like five people and a manager. And we’ll be 50. We’re like 50ish people now; probably will be at 100 at the end of the year, if we can hire and find the talent we need. So feel free to apply if you’re interested.
Ryan Boyer 09:47
The main thing that I would say is that change is just consistent at a company that is growing as fast as Shipt has been. And that is true for how we do data science. When I first joined, we were very scrappy and had little oversight. And it was kind of awesome. Because it’s just like I wrote some code. And your code would be like, cool. Do you like it? And you’d say, Yeah, I’d be like, Okay. And we read it out. You know, we deploy it, and we’d see what happens. And we’d learn from that.
Ryan Boyer 10:19
Of course, now, we are much bigger, much more at scale. And we have a much more rigorous system for deploys, and peoples, and checks, and understanding how it’s going to affect things. But there’s still this, in my opinion, this desire to learn through experimentation, to learn as much as we can and to go as fast as we can with a little more cautiousness as well. So I would say, like, a lot has changed, but a lot has stayed the same.
Eric Dodds 10:48
That’s really encouraging to hear that you still feel like the startup mindset and agility is still there, because that’s often something you hear people sort of bemoan post acquisition is, you know, we’re part of a big company now. And it feels like we’re part of a big company. But that’s really encouraging.
Ryan Boyer 11:07
Yeah, I would say it’s gotten harder to be as agile, like, you know, no one told me I couldn’t do anything back in the day, and I just did things. And now we have people who are, you know, trying to figure out what’s best and there is a desire to move the large ship in one direction. But I do feel that data science is something that you’re just going to fail at, a lot of times, like you’re going to build models that are not going to work. You’ll run statistical analysis, and you’re not going to find anything. And if you lose that ability to learn by doing, like not starting a project until everyone’s on board with it, or sometimes not even running a model in a small production test, until you feel very confident about how it’s going to behave like you’re going to have a lot of trouble moving quickly in the data science space.
Kostas Pardalis 11:52
This is great, Ryan. I want to go a little bit back in our conversation, and to the part where you were discussing AI, machine learning, and the impact that it has on our lives. And I want to ask you about something very specific, and this is bias. So I want to hear from you, first of all, help us understand a little bit better how bias is introduced, how it is, let’s say represented or like goes into like the models, and based on your experience, what kind of impact bias can have at the end to the end user of any model that like a data scientist can build?
Ryan Boyer 12:30
Yeah, so just to confirm, you’re talking about what I would call people bias opposed to statistical bias, the mathematical term, correct? Okay. Just making sure. I would be surprised. And I’d be like, I don’t know, it’s been a long time since I thought about things at that statistical level. Bias, I believe, this is me as a person. We’ll go to me as a data scientist in a minute. Bias to me is just something that is innate to the human experience, right? Like, you don’t know what you don’t know. And it’s really hard to understand what you don’t know. And to me, a lot of the ways that bias enters a modeling process or an analytical process is through that unknown. You’re unaware that your sample of your data set only represents people from the southeast, or you’re unaware of something like that. And then in that process, you end up building a model that may be biased towards a certain member or customer type or segment of your business. Like, one of the classic ones you hear about in banking is like using zip codes, and zip codes end up being racially discriminatory, because if your model ends up not getting someone a loan, because of their zip code, it can often be that the zip code is predominantly a certain race, and you end up having like, bias built into the model. So as a data scientist, our job is to, in my mind, identify representative data samples before you start building a model and account for that data science, that bias up front. And we’re never going to be perfect. Like that’s another thing that I feel like can be hard with data science models is like, they’re never going to be 100% accurate. But we need to make a best faith effort to control for bias on our data, control for bias in the features of our models, and ensure that we are building things that treat others as you want to be treated and are fair in their execution.
Kostas Pardalis 14:39
That’s amazing, actually, it’s a very interesting and fascinating topic. And I think what is more important about this topic is like we when we start talking about bias and how this can be introduced, because like there are humans behind these models, right? Like that’s the point where people can start to understand that, you know something, like at the end, it’s a human creation, and it reflects us. Right? So people at the same time, they are not that much aware about that when we’re talking about AI, like the public out there, we’re not talking about like the engineers or the data scientists, right? They think that it’s some kind of solution that gives, you know, the absolute truth, or it will always operate as we are used to with our cell phone, right, which is reliable, and all that stuff. So how can we communicate that to the public out there, and how we can both as like data scientists and as product managers who are productizing, these models, how can we build like experiences that they can educate in a way, let’s say, the people out there to feel more comfortable with this new way of interacting with technology, which includes mistakes; it includes bias, right?
Ryan Boyer 15:52
I really like the word you used: educate. I really believe that for anything new to be successful, the people who are championing it have to also be educators of that domain. For data science to be used in new places at Shipt, I have to be an educator about what data science can do, because it is unknown to others. And sometimes to me, but that’s a different discussion, I really believe that the experiences that a model or the interventions driven by a model, or the experiences that a model drives, especially as they’re new, need to either have an education component to them, or be a gradual transition to that distant future or whatever. Like, the idea I always think of, and I’ve always heard learning about this kind of idea is like, you know, when they introduced elevators long before any of us were born, people were terrified to get on elevators, because this was a brand new idea of going up and down a machine and who knows what’s going to happen and to assuage those fears, they said, elevator operators, little dudes are going to push the button to go up and down. And that will give you some comfort in this new system as we figure out how it’s going to work and we can educate you. And then of course, now, like, elevators are wholly complex, automated systems. I think data science is the same way, like, it’s always going to be a challenge to go to deploy the cutting edge in a way that is comfortable to people. What you can do, though, is make small steps towards that, and work to educate in the process of releasing new models and new experiences.
Kostas Pardalis 17:38
Yeah. What’s your feeling so far? Do you think we are doing a good job educating the people?
Ryan Boyer 17:45
You guys? Yeah, you’re doing great. I feel like there’s a lot of hype about data science. And as I mean, I think being a data scientist is being a skeptic, like that’s one of the things that makes you successful. Like, is this data really saying what you’re telling me it’s saying? I think that there’s a lot of opportunity for data science to solve a lot of important problems in the world. I don’t think it’s this magic solution or the silver bullet. And, you know, like you said, we’re gonna have Terminators walking around. Like, we kind of already have cyborgs. Right? You’ve got people with pacemakers, and all that kind of stuff, right? Like, and we think of that as normal. I think that there’s gonna be a gradual rollout of advances in technology, including data science, and it will come at a slow enough pace that we’ll only realize in hindsight, like, oh, yeah, cyborgs walk among us, these guys with pacemakers and transplants, and all that stuff to survive, like, I think we’ll feel the same way about data science in 10 or 20 years.
Eric Dodds 18:43
I think one of the challenges with data science is that in terms of and we’ve talked about this before, sort of the public brand of data science and you know, machine learning and artificial intelligence is when it’s done really well, the experience is simple and congruent for the user. And so it’s like, you want to think about it as like a Rube Goldberg machine, right? Like, you know, there’s an old movie called Chitty Chitty Bang Bang, and there’s this really complicated machine that literally just cracks an egg, and then puts it on a plate for breakfast. And it goes to this really complex process. But the result is simple, right? It’s just, you know, you have breakfast. And data science is the same way. And so it’s really hard for the average person to appreciate the complexity that goes into something that just means that their recommendations make really logical sense, you know, in an app or something.
Ryan Boyer 19:50
Yeah, I would add to that, I think that most data science models solve very simple problems too, right? It’s, you know, predicting whether this product is a good recommendation or not, or predicting whether a person will be a member of a subscription service or not in 30 days. The Rube Goldberg complexity part like in the interaction comes from how you use those, in my opinion. And when you start stacking models together and pairing them with email marketing, or ads, or recommendations or changing how an app performs, that’s where the complexity comes to my mind. Like, obviously, there’s the complexity on the front end of cleaning your data and making sure it’s representative, avoiding biases, doing your due diligence to do data science, well and ethically. But, the complexity is so much more than the data science itself.
Eric Dodds 20:45
But speaking of complexity, one thing I wanted to ask you about is, and this really plays off of what we just talked about, so we use Shipt in our household. It’s a great service. We love it. And at a very high level, it’s so simple, right? You open an app, you choose the groceries that you want, and then someone delivers the groceries to you. It’s so nice and simple. But before the show, I was making a Shipt order. And I realized thinking about it through the lens of data science and sort of your role and thinking about the show, I realized this is so complex. I mean, there’s so many moving parts here. The app itself, I think, is very well designed, because there’s a ton going on, especially on the mobile side. You have to fit so many possible sorts of decisions into a small screen. But then I realized that’s just one side of it, right? I’m the consumer on the e-commerce side. And then there’s an entirely different experience for the person who’s picking the groceries and then delivering them. And so can you speak a little bit to the complexity, I’m sure there are things that I’m not even imagining, but it seems like a pretty wild set of data that you have coming in.
Ryan Boyer 22:02
Yeah, I’ll say up front that I think, what I call the datascape at Shipt, like I’ll never have this quality of problems, quality of data, volume of data, you know, just thing Greenfield problems to solve anywhere else. It’s just so big and so vast and so complex. And it’s been great as a data scientist. I would also say you’ve identified two of the several sides of our multi-sided marketplace. We also partner with retailers to get their product inventory data into our app, and partner with CPG brands like Coca-Cola or Pepsi to get up to date nutrition information and sales data and like sales and coupons and stuff into our app. So we really have like four people kind of converging on this space of Shipt and all trying to make business exchanges, if that makes sense. It’s extremely complicated in that there’s just so many people assuming different priorities. And what we have to do to be successful as a business and as data scientists is to prioritize, like, fundamentally just prioritize what is the most important for us to do now? Because we certainly can’t do it all?
Eric Dodds 23:14
Absolutely. And what does that process look like? I mean, so you have these different data sources. And I’d love to know, sort of as a team, and I know, it’s beyond just the data science team, because you’re working with, you know, probably all sorts of other teams. But I’d just love for our listeners to hear what that process looks like? How do you prioritize the work that data science does, and what does that decision making process look like internally?
Ryan Boyer 23:39
Yeah, that’s the hardest part. Like, I mean, I can try to talk about how it is, but I think it’s constantly growing and changing, especially as we as a company grow and our business changes. You know, when I joined Shipt, we were in, like, 20 cities, and we just offered shop and deliver. So now we are in I think 45 to 48 states in the United States, we offer shop and deliver, we offer delivery only where a retailer makes the basket for you and our shopper just picks it up and drives it. We have four or five other kinds of business models. We deliver from places beyond grocery stores, Target, places like Party City, I think a couple sporting goods stores. I can’t even keep up with it.
Ryan Boyer 24:23
The business has changed so much that the main thing I would say about prioritization is that it’s not a one-time thing. It is a process, and it’s an ongoing process. And it can be painful to have something that you’ve worked on all of a sudden, like not being a priority anymore and to be shifting gears. But I think that’s necessary to be successful. In terms of how it actually happens. It’s getting a lot of people in a room together to hash it out and talk about it. And then at the end of the day, someone’s got to make a decision and hopefully the group can collectively come to a consensus but as we know, sometimes people disagree and it takes leadership to help guide the ship.
Kostas Pardalis 25:08
Ryan, you’ve mentioned that a lot of changes have happened in the company since you joined, because you also joined at a very early stage and the company also grew really, really fast. Can you tell us like a little bit how your work as a data scientist changed and how it was affected by this growth?
Ryan Boyer 25:29
Yeah, so I would say the first thing that changed is I now am focused on a much smaller section of the business and smaller problem scope. You know, back in the day, I did infrastructure things, I’d deviate a database or two, you know, I built dashboards, I did everything. I wore so many hats, and also worked with so many different components of the business, marketing, finance, accounting, engineering, operations, product–I was everywhere. As the company has grown, I don’t do as much with internal tools, and my focus has been much more on the operation side, or things that kind of take place across the operation side. So maybe, like we talked about out of stocks, right, that kind of spans the basket building on our member customer side and to the shopper side.
Ryan Boyer 26:19
So my scope has narrowed. And that’s been great, because I’ve been able to go much more in depth with these problems. The solutions we were providing, back in the day when I started, were all very simple. We tried to be pragmatic about it. Like there’s no use in spending two months extra to get a 5% improvement, right? Like, let’s get something simple. Let’s get it out there. I think we still embrace that ideal, but we just have much more opportunity to tackle harder problems. And so we get opportunities to invest in more complicated methodologies, more complicated problems, and hopefully bigger and more important solutions for our business.
Kostas Pardalis 27:01
So would you say that, let’s say the value of data science as an organization, as a function inside the company has shifted, compared to how it was at the beginning and how it is today, or is it just the scale that’s changed and like the structure of the organization?
Ryan Boyer 27:18
I think both. So Shipt really was relationally oriented in the early days, and we still are, to be fair. We still very much … kind of a core driver for Shipt is opportunity. We want to give people an opportunity to have more time with their family by not having to go to the store. We want to give our shoppers an opportunity to earn more income or supplemental income or a full-time job to provide for them and their family. Early in Shipt that was the core of our business and it was small enough that we could manage it in I would say in a simple way. Like simple technology, simple rules, simple operations. Not that it was simple. It was very complex. But we didn’t need to rely on data science as much. As we’ve scaled and grown, data science and engineering have become so much more critical to the success of Shipt: being able to function at scale, being able to be efficient across the wide variety of businesses so that we can still be relational at our core, offer provide opportunity to people at our core.
Kostas Pardalis 28:30
Do you think there is a time that it’s too early for a company to invest in data science? Based on your experience?
Ryan Boyer 28:37
I would say no. But what I would say is that investing in data science often really means investing in data science foundations: data engineers, analytics, getting to a place where analytics are driving the business as opposed to reactively interpreting, like things the business attempts. That all is so important, and to me that is what investing in data science means for a younger company. That sets the stage for the fancy data science that we all think of when we say data science, advanced models, statistical analysis, that kind of stuff.
Kostas Pardalis 29:24
So from what I understand like in Shipt, data science is also like a big part of the product, right? Like there are features of the product that are actually driven by data science. And we will discuss more about this in a bit. But before we go there, are other other functions of the company right now that benefit from having very strong data science team sides.
Ryan Boyer 29:44
Yeah, absolutely. So features of members and shoppers is one thing identified. Obviously, marketing and retention can benefit a lot from data science and just trying to understand who our customers are and how we can make them happy effectively. We have a lot of natural language processing data science problems at Shipt as well. All of the products that we get from our retail partners and from third party sources and trying to enrich those. It’s really challenging to know if this package of Goldfish that Target sells is the same as this package of Goldfish that Winn-Dixie sells. Like, there’s a lot of natural language processing problems there of cleaning those up, identifying their brands, identifying if they are the same product across stores, and getting our data catalog in a way that is standardized across locations. There’s plenty of finance and accounting, modeling and forecasting components. And then I’d also say, there’s a big operations component. We have a marketplace, and we need to make sure that supply and demand are balanced. How do we hire shoppers? How do we match shoppers and orders? All those need to be done within the context of who we want to be as a business and how we want to value our shoppers and value our members. But data science drives a key role and all those things at Shipt.
Kostas Pardalis 31:06
Yeah, it’s super interesting. I want to ask you another question. And to try and make Eric happy. So my question is, Ryan, can you give us a tip, or help us understand how data science can help marketing, especially in a way that it’s not that obvious to most of our people, people like me out there that we are not like actively working with marketing or data science.
Ryan Boyer 31:38
I will say that this is something that I have, probably the area I spent the least amount of time at Shipt focusing on. But a common way I’ve seen data science used across multiple companies is for subscription services, identifying likelihood of churn. So you could build a data science model that predicts if a subscriber of your service will still be there and still be a member in 30 days, or 90 days, or 15 days, whatever time interval you want. Coming out of that you can get an understanding of who you need to target for retention. And this can be as simple as reaching out to them on the phone and asking how they’re doing. Like for a small, you know, SaaS company, or for something like Shipt, this could be something like extending a discount, or, you know, giving them a $5 credit to try to get them to re-engage with a service. And those interventions obviously need to be domain specific. But if you can understand who is appreciating your service, and who is not appreciating your service, you can begin to try to figure out why and how you can fix that problem.
Kostas Pardalis 32:47
What do you think, Eric? Is this something useful for marketing?
Eric Dodds 32:52
Yeah, and I will tell you, I’m very happy. Thank you, Kostas. You made my day. No, I think if you put yourself in the shoes of someone in marketing, harkening back to my previous life, I think the challenge you have at scale is that the analytics tools that you’re using, are built to predict churn with sort of, like custom inputs, right. And you can’t really do it in a spreadsheet, because it’s way too much volume. And so you’re sort of, you can sort of anecdotally look at individual customer journeys to try and, and give yourself an idea of what types of things might be causing churn. But it’s pretty hard tactically to achieve sort of a statistically significant view, as a marketer, you know, if you’re not really, really good at SQL, but even then, you know, you’re sort of at a large company, dealing with, you know, sort of access to databases and all that sort of stuff. So absolutely, especially at scale. I mean, I can’t imagine, you know, trying to crunch data at a company like Shipt because of how much there is. So yeah, I think there’s, there’s huge value in that. Ryan, one thing you mentioned, actually, and I’m interested in this kind of marketing product standpoint, but you’d mentioned prior to the show, a sort of cold start problem with search. And I’d love to hear the story around that tactically. I think our audience would love to hear about that. Could you talk about that particular problem and how you solved it?
Ryan Boyer 34:23
Yeah. So back in the day, the good old days, Shipt was small. And I think I was hired before we had an engineer who was responsible for search. You know, it was like, we had an engineering team, and they had solved search and some problems, we didn’t have a search engineer. And so the problem that we had was, every time we launched a new retail partner, we had a brand new catalog of data. No one had ever seen it before. No one has ever bought anything before. I mean, yes, people had bought Goldfish at prior private retailers. But how should we show search results from a new catalog? How do we handle things like house brands, or you know, things unique regionally, those differences, when basically, we had nothing. I’ll start by saying that our search team has solved this better than I did early in Shipt’s lifetime. And so my solution has now been deprecated and laid to rest, and we are all better off for it. But what we did to solve this problem was basically build what I would call a human-in-the-loop tool that allowed us to use machine learning and then polish it at the very end to give us a great search experience on day one for our new retail partners.
Ryan Boyer 35:38
What we first did was we did some advanced natural language processing stuff to compare products from existing stores that we have sold to new stores that we had never shown, seen, or sold any of their products for. Getting a little tangential here, but I’ll be upfront and say that UPCs are neither universal nor unique. So the idea of understanding exactly what product at store A is the same at store B is not as simple as you think it would be. And if you’ve ever looked at your receipt, and seen them, you know, taking all the vowels out of a product, and then giving you the price. Like sometimes our data comes in like that, or at least used to. So there’s this fundamental problem of identifying first, what new products are similar to old products, and then sorting them based on that similarity and inferring from the old product search rankings where the new products should be. So that’s, that’s a very high level of how we did it. Technically, I would say we built an in house KNN (k nearest neighbors) clustering model, and then inferred search search results off the clusters. And then we had a suite of tools that allowed us to go ahead and you know, people always buy bananas. Do bananas come up near the top of the list? What happens if you search for cheese? Does it look right? We were very able to just manually clean it up.
Eric Dodds 37:09
Sure. And did you see pretty significant improvements to people’s sort of initial search experiences?
Ryan Boyer 37:16
Yeah, so we definitely saw previous initial search experiences. It’s a challenging thing to test and measure because every new retailer is different. So there’s a lot of confounding variables. But we did see significant improvements in search conversion rates, both based on what we had been doing beforehand, which was people identifying manually the top 1,000 items and sorting them at that point in time. That was old Shipt, like every search rankings were just, you know, from top to bottom filtered in order; that was kind of how it worked. The other big benefit we had is that the time to market for solving the search problem was drastically cut down, it would take, you know, our catalog team, multiple hours, multiple people for you know, for four hours to eight hours to just initialize search for a new retailer. With the data science model, a data scientist was able to do it in, you know, an hour active time plus whatever computational time it took, and get better results while saving a lot of man hours in the process.
Kostas Pardalis 38:22
That’s amazing, Ryan. Can you share with us a little bit more information about the data stack?
Ryan Boyer 38:27
We started from a very early Shipt with that last conversation moving up to a more modern Shipt. The data stack has evolved over time. Today we use Snowflake as our data warehouse solution. We have Postgres databases, which may or may not be used by the data science team. And our data scientists for BI tools use Tableau. We also use DBT a lot for data engineering purposes. But all of our actual model deployment processes, taking the out of stock model and running it in production so that it feeds real time systems. All of that is built in house. And we are building a team at Shipt right now to really build the next generation, like model deployment stuff, model ml platform is kind of what we’re calling it, build the next generation ml platform from Shipt because we’re still running on some of the stuff that we hacked together in a couple afternoons several years ago. There’s a lot of excitement there.
Kostas Pardalis 39:26
That’s super interesting. Actually, we had an episode a couple of weeks ago with Tecton. You probably know them …
Ryan Boyer 39:35
I met some of the guys at Tecton. They’re doing some really cool stuff over there.
Kostas Pardalis 39:39
Yeah. And what I found very interesting is that, actually, in this space that we call feature stores, which I mean, okay, I think still, there’s a lot of confusion about what these things are, but there’s not a lot of, let’s say, open source solutions out there. Actually, there’s only one which I think is called Feast. Any plans from your side to open source anything?
Ryan Boyer 39:59
I don’t know is the honest answer. Personally, I would love to open source things. I think that sounds fun and satisfying. I think that more than likely our team will be relying and building on top of a lot of the existing open source tools that are out there and then tweaking it for our needs.
Kostas Pardalis 40:17
Yeah, makes sense. Makes sense.
Ryan Boyer 40:19
I would say that ML platform is a huge competitive advantage these days in the data science space. After the people part of data science, the next hardest part is actually integrating it with all the things. Like how do you take that tool and use it effectively? How do you do it in real time? So that’s my understanding and expectation of why there’s not a lot of open source machine learning platforms out there, because to the people who have built it and done it well, like it helps them succeed and helps them outlast their competitors.
Kostas Pardalis 40:50
Yeah, I think that’s an excellent point. And I think it’s a very good explanation of why this is happening. And I think it explains why even traditional companies, like really big companies that traditionally has a lot of open source presence, like Netflix, for example. Even them, they haven’t made public the feature stores that they have built. You see a lot of talks about it, presentations, and all that stuff. But like, none of this is open source yet. And I think it’s an excellent point that you’re making. It’s like an actual competitive advantage that companies have by having these systems in house. So it makes total sense. So you shared with us, like the data stack that you have, are there any other specific tools that are used only by the data scientists? How do you build and how do you iterate on your models? Like are there any frameworks that you’re using for that, libraries, anything specific that you would like to share with us?
Ryan Boyer 41:48
So first thing I’ll say is, I think that we are evolving there and we’ll have a lot of new tools to better set up our model building systems down the road. We’ve looked into all kinds of things from MLflow and coop flow for artifacts storage, and model iteration storage, like there’s a lot of opportunity out there and we’re trying to decide what we want to build in house, what we want to pay someone for, and what we want to use open source for. In terms of tools that we commonly use, Shipt data science is very ambidextrous, we use both R and Python as the problem needs.
Ryan Boyer 42:26
I will say that anytime we start getting into that real time space, we start having to think about feature stores and think about API’s, Python’s going to win out there. But oftentimes, we find that R is much more helpful at that exploratory data analysis phase. And we do have internal packages for both R and Python that allow us to very easily communicate with all of our data stores and write and push data to them, as well as our cloud provider. In terms of like other tools and the process flow, I think what we really want to do is build our systems in a way that data science can iterate independently of the rest of the business. Obviously, not all cases, is that okay for us to do. But like if I’m building a recommendation engine, the goal would be for me to build it in a way that it communicates consistently with engineering. And then I can begin iterating on it however I want to improve it. Working with product managers and businesses so they’re aware of the changes, but apart from the current systems. So we really embrace that kind of micro-service idea. Like that’d be the engineering component of it. Like microservices at Shipt, and really strive to build it simple at first, and then iterate and learn as we launch and run.
Kostas Pardalis 43:48
This is great. Last question from my side. And then I’ll let Eric ask any questions he might have. What’s the relationship with data engineering and how do you work together with them? And how is the function defined inside Shipt?
Ryan Boyer 44:02
Yeah, so I would say there’s actually kind of two data engineering groups at Shipt. One data engineering group at Shipt is all about getting data that our partners provide and getting it into our system so that we can sell the products they have. And that is a lot of data. And historically we’ve worked very closely with that group just in terms of building solutions to clean, standardize, and understand the product data that’s coming in from our partners, predicting what brand a product is, if it doesn’t come tagged with a brand, identifying two products with the same, identifying if this picture is correct, that kind of stuff.
Ryan Boyer 44:41
The other data engineering group is all about building methodologies for our engineering microservices, and their data to be stored in our data warehouse and transforming that data into things that can be used by data scientists for analytics or by others in the company for analytics. We work closely with them, though I think that that team is going to continue to scale even more as our group is growing. A lot of the challenge that we have at Shipt with data and growing fast is that things change, and it’s hard to know when they change. Like, if engineering changes the way they’re solving a certain problem in the business, it can be challenging for us to know that that happened way downstream. Our data engineering team is crucial for handling those changes and ensuring that we get clear and ready data. And they’re, they’re a bunch of great guys. I love them a lot.
Eric Dodds 45:44
Ryan, one question on the data engineering side, going back to sort of the early days before maybe the data engineering team was as big as it is today, did the data science team actually do some of the data engineering work as well? Or has there always been a sort of clear delineation of responsibility?
Ryan Boyer 46:03
There has not. And in a lot of ways, there still isn’t.Like the data I need to build an out of stock model, right, it’s not going to be present in like this perfect form where I can just select star from my table, you know, and roll with it. There’s still a lot of data engineering that I have to do to, that we have to do. Data scientists build our models and build the pipelines that feed and serve those models and to run the analytics from those models back into a place where they can be analyzed later. But I will say that the demarcation is much cleaner today than it was in the past. Very early on, I did a lot of data engineering. And that was just a necessary thing for data science to work and function at that time, because there weren’t as many dedicated resources to internal data engineering.
Eric Dodds 46:50
Sure. Super interesting. Well, we’re close to time here. So we’ll ask one more question before we hop off, what are you most excited about in terms of trends in data science that you kind of see on the front lines doing the work every day?
Ryan Boyer 47:06
Yeah, so that’s actually a really good question and a really hard question. One thing that I really am excited about, and this is broad industry right now, I feel like some of the hype is dying down. There was this idea that data science is going to solve all of our problems, self-driving cars are going to be here, and they’re not and all our problems haven’t been solved yet. And so we’re at a point where we’re coming to kind of terms with what data science can do. And we’re beginning to really, as an industry, begin to make tangible steps forward as opposed to having to dance around the hype and expectations. So that’s one thing that really excites me. I think the other thing is that people are becoming more and more receptive to data science being used in effective ways. And we are really learning as an industry how to do data science effectively, like you all talked about this earlier. Lots of people have come in and talked about the human element of data science and how important that is. And as an industry, we’re really starting to realize that and best practices are being developed. And a whole bunch of companies have popped up to provide services for MLOps and how we can do data science at scale and monitor data science at scale. Like, we’re coming out of the kind of Wild West early days of data hype into a more steady and stable industry with some more best practices around. That’s not to say that we are stable, and there’s not a Wild West component of this, but I feel like it’s much more clear how to solve a lot of common problems today than it was when I started. I’ve also learned a lot in that time too though.
Eric Dodds 48:51
Well, even I mean, it’s interesting, even if you think about when you started at Shipt, and you know, to today, the number of new tools even that have been introduced that make a lot of these things easy, or tools that have been developed internally, you sort of have this maturing of the discipline where some of the technical problems are getting out of the way. And you can focus on the deeper problems that you’re actually trying to solve as opposed to building the infrastructure that makes it easier for you to solve them.
Ryan Boyer 49:23
Absolutely. Yeah. And I mean, an example of how this changed for us like early on at Shipt, we used Airflow, batch job orchestrated tool Airflow, it was pretty much the only option at that point in time. We’ve revisited that. We’re still using Airflow but we’ve revisited and had discussion about whether that’s right for us these days. And there are now six or ten options and plenty more that I don’t know about that each meet that same general need but do it in slightly different ways or slightly different targets, or slightly different niches. And from that it’s just wonderful to have opportunities and choices to figure out what you want to do with your business and to lean on the expertise of others.
Eric Dodds 50:05
Absolutely. Well, Ryan, it has been such a pleasure to have you on the show. So interesting to hear about everything that’s going on at Shipt. Congratulations on the success. And best of luck in hiring, doubling the size of the team before the end of the year. That’s a tall order.
Ryan Boyer 50:22
Yeah, hiring is hard. I’m excited for it, though. We need all the help we can get.
Eric Dodds 50:28
Cool. Well, we’d love to check back in with you on a future episode. And thanks again.
Ryan Boyer 50:33
Thank you so much. I really appreciate it, Eric. Thanks, Kostas.
Kostas Pardalis 50:36
Thank you, Ryan. It was great.
Eric Dodds 50:38
As always a fascinating conversation. I think one of my big takeaways was hearing about how things have stayed the same in many ways, going through a huge acquisition by such a large company like Target. That was just really cool to hear. I mean, obviously, there’s more sort of structure being part of a larger company. But it’s really neat to hear. A lot of times you’ll hear the opposite story where a company gets acquired, and sort of your ability to be agile early on dissolves. And it’s not as gratifying to be part of the team anymore. But I didn’t get that sense at all. But it just makes me really happy to hear that that was sort of managed well, and that they can still have that startup type feel to some extent at a big company.
Kostas Pardalis 51:22
Yeah, I’m a little bit disappointed. To be honest, Eric, we have another data scientist who said that we’re not going to see Terminator anytime soon.
Eric Dodds 51:33
It’s a long time until 2030…
Kostas Pardalis 51:39
Regardless of that, it was a great conversation with Ryan. And I think it’s amazing to hear from people about what kind of impact data science can have in a company and how many different aspects of the company it can affect. And I think Shipt, from what I understand, like during the conversation we had, that you have internal users, you have around the product with it, pretty much every stakeholder around the company is affected by data science. And I hope that we will have more and more opportunities in the future to communicate and educate the people out there about how data science is an important part of any tech company today. And not only tech, actually any company.
Eric Dodds 52:24
And I think one theme that’s been recurring is the human element of data science, which I think has been really interesting to hear about. And Ryan brought that up without us even bringing it up. And that’s just been a constant theme with all of our guests. Which is, I think, both fascinating and encouraging.
Kostas Pardalis 52:41
Eric Dodds 52:42
All right. Well, until next time, thank you for joining us on The Data Stack show, make sure to subscribe on your favorite podcast app. You’ll get notified of new episodes every week, and we’ll catch you in the next one.
Eric Dodds 52:55
The Data Stack Show is brought to you by RudderStack, the complete customer data pipeline solution. Learn more at RudderStack.com.