This week on The Data Stack Show, Kostas and Eric are joined by Arian Osman, a senior data scientist at Homesnap who is also nearing the end of his PhD in computational sciences and informatics and is the developer of an e-commerce clothing brand. Homesnap is designed for both homebuyers and agents to access data from the MLS (Multiple Listing Service), providing real-time, accurate information to all parties involved.
Highlights from this week’s episode include:
- Arian’s background and an overview of Homesnap (2:30)
- Utilizing data in Arian’s e-commerce clothing brand (7:14)
- Homesnap’s sell speed feature and visualizing outputs (13:28)
- The psychology that drives upper and lower limits (19:33)
- Deciding the life-cycle of a model (25:50)
- Collaborating with internal stakeholders (30:47)
- Unique challenges of data in the real estate domain (38:16)
- Useful third-party tools (43:33)
The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 00:06
Welcome to The Data Stack Show, we have a really interesting guest, another PhD or PhD candidate, he’s really close. Arian Osman from Homesnap. Because my mother has been a real estate agent for a long time and Homesnap’s in the real estate space. I’m interested to ask some questions on the data there, just because I know, it can be pretty messy. But Arian also has a pretty interesting background in consulting and as an entrepreneur, so I’m just excited to hear about his experience. Kostas, what are you going to ask him?
Kostas Pardalis 00:39
I think it’s going to be interesting. Also, I think it’s not the first time that we have someone who’s working in the real estate space. So I have a feeling that the space is going through a lot of digital transformation. There’s a lot of work to be done with data in this space. I think it’s quite interesting. Also, it’s very interesting that we are going to have another data scientist. As we said, in previous episodes, we used to focus more on data engineers, but super, super interesting to hear also, the work and the point of view of data scientists inside the company. And I think we will see more that there is an overlap, actually, between all these roles. And I think this is going to be quite interesting to observe as we continue. My questions, I think will be more around the intersection between data science and products, how we productize these models that data scientists create. What does it take from an organizational point of view, to create an organization center combined to create concrete, this kind of products? I think, product management and product strategy perspective, there are many challenges ahead. I mean, we’ve done a lot of progress, figuring out how to build software products. And now we are moving more towards data, let’s say products, and there are new things to learn and explore in terms of how to build these products.
Eric Dodds 01:59
Great. Well, let’s dive in and talk with Arian.
Kostas Pardalis 02:02
Let’s do it.
Eric Dodds 02:03
Welcome back to the data stack show. We have Arian Osman from Homesnap on today, although we’re going to hear about a lot of the other interesting things you’re up to Arian. Thank you for joining us.
Arian Osman 02:15
Thank you for having me.
Eric Dodds 02:17
All right, getting started, could you just tell us a little bit about your background, which I want to dig into a little bit more. And then just a quick overview of Homesnap and what the company does in the real estate space.
Arian Osman 02:30
Sounds great. So with respect to … I’m currently the senior data scientist at Homesnap, one of the leads, there is another lead as well, who’s more focused on, you know, the out of the box solutions, and I lead both on the theoretical side, as well as custom implementations for Homesnap. I’ve previously worked as a software engineer, database administrator, database developer, as well as creating data pipelines. So anything data-related I’ve been involved in, including QA. So when you’d see the whole software development lifecycle, I’ve kind of been involved in each sort of facet. And one of my skill sets is that, you know, I’m able to incorporate that strategy to the data science team at Homesnap, which is one of the reasons why they hired me. So having that diverse background and being able to acclimate to utilizing multiple technologies has been a forte of mine for the longest time. So when it comes to adapting to each previous organization, I was able to do so successfully. My background, education wise, is in mathematics. I’m currently finishing up my PhD in computational sciences and informatics at George Mason University in Fairfax, Virginia, and hoping to finish that by the summer. And it’s kind of related to, well, it is related to what I do at Homesnap, but in a different sort of realm. But there’s quite a bit of overlap.
Eric Dodds 04:01
Awesome. And could you give us just a quick overview of Homesnap, I know a lot of our listeners are probably familiar with it if they’ve shopped for a home, but would love to just hear about what the company does and the problem that you solve.
Arian Osman 04:14
Sure. Homesnap is, like you said, a real estate application. And it’s utilized and used by consumers. But we focus on the agent experience as well. That’s what makes us different from other apps. As you know, well as you may know, or may not know, MLS data is very hard to access. So from the business side of things, we create relationships with those MLSs to obtain real estate data, which gives us that additional edge. On the long side of that we utilize public information, census information and anything that we can find with respect to our needs as data scientists. Yeah, so, you know, agents can make listings, they can, you know, create their own custom sites within the application, you know, we have some great products and subscription products that they can subscribe to, and has some benefits depending on what level they are. So we essentially create a great product for them. And my role is twofold at Homesnap: one to implement artificial intelligence into the application, but as well as help the company gain insights internally. So you know, it kind of goes on both sides of the spectrum there when it comes to what’s internal, and what’s external.
Eric Dodds 05:44
Very cool. Well, my mom is actually a real estate agent. And so I’ve, she’s asked me a couple times why the MLS service she uses isn’t working correctly. So I’ve seen, I’ve actually seen the software and it doesn’t surprise me at all that the data is a mess. So I want to hear about that. But all I know, Kostas probably has a ton of questions. But I’m going to take the first one here. And it’s a little bit more about your background. So I know talking before the call, we talked about just multiple different things. So you’ve had a background in consulting in sort of data-related work. So databases, data pipelines of wide variety of tooling, which I think is really interesting, you are working on your PhD, and you have an e-commerce brand, which is a really interesting combination of experiences, both sort of on the theory side and the practical side, you know, you think about like the academic pursuit in studying mathematics, and then what we would say is sort of the bleeding edge of data, which is e-commerce, right? That’s sort of that’s the, you know, one of the most-interesting spaces as far as data in real time, and all that sort of stuff. So I’d love to know, as you kind of think about those different experiences, what are trends that you’ve seen, I mean, you’ve sort of ingested a lot of different experiences, any major things that stick out across those sort of different verticals of experience?
Arian Osman 07:14
Well, what I could tell you flat out from a data science standpoint, is that tools and products for consuming data science implementations are constantly evolving, whether it, you know, be on AWS, and you know, other platforms as well. I mean, they’re constantly improving and evolving and getting better. That’s primarily what I’ve noticed from let’s say, if we go to SQL Server, for example, in 2016, 2017, they recently integrated the utilization and consumption of R scripts and Python scripts. So a lot of existing and older technologies are trying to implement what is currently popular in the data science worlds. So that’s, that’s kind of the experience I’ve been seeing. When it comes to relational databases, and software and third-party tools. I mean, the third-party tools are just getting better, smarter, faster, and trying to make the lives of the consumers easier, which is great. Me being trained in mathematics and computational sciences, I’ve created more customized tools. So for example, you have, you know, AWS implementations that utilize deep learning. I actually know the theory behind it, as well as computing the custom code necessary to develop an implementation like that. So instead of using, you know, libraries, and what have you. So that’s kind of where my skill set is, and if there is something out there, one of the benefits of working at Homesnap as a data scientist is that we’re able to research and get that information that we need to determine whether or not a custom solution is needed. Or we could utilize a third-party tool. With respect to my e-commerce brand, you know, it’s a clothing brand and, and based on my PhD research, actually, so I can’t go into too much detail on that until my dissertation is published. But what I can tell you is that my background is primarily in image processing, and I extract features and detect features and fingerprints. I essentially applied that skill set to the male form and utilize that knowledge onto images of the male form where I would construct what would be considered optimal looks and feels and cuts and fabrics for a particular line depending on what you want to show. So for example, you know, I’m a shorter guy. I’m five-seven. And the thing is if you wear longer shorts, you’re gonna look even shorter. So if you were shorter shorts, you’ll look taller. So stuff like that I take into consideration. So when it comes to that sort of thing, so currently, my brand is based off of my, my current stature. But as I built the brand further, you know, I’ve had models that are six-two, and you know, I actually have a project happening on Sunday where the model is six-six. So you know, it’s essentially a trial and error thing where they review where they give their input. And from that, you know, having that input, and if, if it exists, the constructive criticism helps me to improve my implementations.
Eric Dodds 10:47
Very interesting. Yeah, that is… Well I’ve talked about that really with just my wife on, you know, like, I have really long arms and kind of a long torso, right. And so, even though something may be the right size, on paper, she’s like, it looks weird, I’m not great at picking my own clothes, right? She’s like, it looks weird on you, even though it’s the right size, just because, proportionally you’re different. And it’s fascinating to think about that from the standpoint of data science, and actually using math to sort some of those things out.
Arian Osman 11:19
Yeah, it’s, you know, and when you’re dealing with classification problems in the data science world, you can think of it as, you know, a straight line separating one class from another class, you know, determining what belongs in one side, what belongs on the other side, you’re dealing with, I’ll use this term out, you’re dealing with multiple dimensions when it comes to that. So you know, you talked about having longer arms, I’m not sure how big your arms are, you know, you’re you’re adding certain layers. And the key thing that, you know, has to be communicated is that, you know, data science models are not going to be perfect, we can get them close to perfect to a certain degree. But it also depends on the dimensionality of the problem.
Kostas Pardalis 12:01
That’s super interesting Arian. And I want to ask you something that it’s related a little bit more from the product perspective of what you’re describing. And I think it becomes more and more obvious when machine learning and machine intelligence, penetrating more of our everyday life. So for example, let’s say I go to a shop, to buy clothes, right? There is a person there who is going to guide me who’s going to help me if I’m not feeling sure about something, the person is going to give you some advice. Let’s say it acts in a way like the machine intelligence that you are trying to build, of course, like in a different way. But at the end, the kind of feedback that I will get from that person is going to be part of the system that you’re building. Now, one of the problems from what my understanding, at least with machine learning and machine intelligence in general is that it’s a bit of like a black box, right? Get some input, like for example, have Eric as the input, the size of his body, his proportions, and all that stuff. And the outputs can be something space that includes like, possibly close together with some numbers that might indicate that something is a better fit for him. Let’s say that we do that, right, like to propose to him to get a specific piece compared to someone else. How do we explain that to that person? And how important do you think that it is to explain that?
Arian Osman 13:28
Absolutely. And that’s a great question. And, you know, when I talk about classification problems, you know, you know, there’s binary classifications, whether it’s this way or this way, but what you also have to understand that there are probabilities associated with it, possibly. So depending on how certain items are computed in this black box, I will not go into too much detail in the black box, but just know that a possible output in the black box could be probabilities, whether this would be 70% chance of a better fit than this 30% chance fit, you know, something along those lines you can have, you can have those numbers represented in that way. And it kind of relates to and it this, that particular problem was new to me. And I actually kind of implemented that problem in Homesnap. So for example, if you go to Homesnap and Google Homesnap and type in the words “sell speed”, that was one of the tools that I created that was integrated into that application. And the problem that we want to address, and mind you, this problem does have multiple dimensions. The problem that we want to solve is how fast with a will a property sell, will it sell within two weeks? Will it sell between two and four weeks? Will it sell before between four and eight weeks, between eight and 12 weeks? Or will it not sell at all, and that’s anything exceeding 12 weeks? How that problem works is that we actually output those probabilities. So they have a slider where you’re, where it’s the input of the price of the home, but also square footage of the home and all those different dimensions are an additional factor. So as you slide the price, you know, you would assume that, you know, the lower the price goes, you know, the more likely a property is going to sell. If you slide it higher, you know, the more likely it won’t sell. And sometimes that’s not necessarily the case, but you have probabilities. So just because something says it’s gonna sell 75%, within 14 days, 25% is going to be greater than 14 days and how that’s distributed is important as well. Even if you have 99%, there’s still that 1% that does matter. So that’s kind of how you compute things. And you kind of add that human intuition whether or not you would go with this number or not, but you have to understand that you have that 1% chance, just like winning the lottery.
Kostas Pardalis 16:13
Yeah, absolutely. I totally understand what you’re describing. Based on your experience at Homesnap, and how people are using this tool. How intuitive do you think it is for people to work with these concepts of probabilities, and their distribution and what that means when it comes to something also quite important, right, which might be like, the house they are selling or buying?
Arian Osman 16:35
Well, this is where we work with the product team and subject matter experts. So as data scientists, you know, there’s so many ways that we can present the output, but visually, and you know, me working in, you know, different facets and, and in different positions has allowed me to, you know, deal with different audiences and how to present certain things. So, one of the things that you have to take into consideration one, of course, what the product team wants. You know, they’re the gatekeepers of what it should look like at the end, but as well, as you know, talking with other sea level persons in the organization, you know, you talk with your managers, you talk with subject matter experts, you talk with product, and we all come into an agreement as to how this data should be presented. And of course, working as a data scientist, we build the models, and we build the proof of concepts, and we present them. And we asked them, Do you think, you know, the consumers of this tool will understand it?’ And, frankly, it’s a bar, it’s a bar graph, you know, the sell speed implementation is a bar graph. And you can’t get much simpler than the bar graph, you don’t need to know, just know that you don’t need to know about probability distributions, you don’t need to know all about, well, any I would call it from if I was from the other side of the aisle, call it gobbledygook. Nobody wants to really get into that much depth when it comes to it. But they respond to visualizations that they can understand. And I think that’s a very, very important thing that should be taken into consideration when you’re actually building the models and presenting the outputs.
Kostas Pardalis 18:21
Eric Dodds 18:22
Arian, quick, one quick question on the models before we leave that topic. And I may not be asking this exactly the right way. But there’s some level of psychology that creates limits, if you think about pricing, and I’m not a pricing psychology expert, but let’s just say for easy math, you have a home, that’s, you know, sort of the fair market value might be $100,000. And so if you slide this slider down, you know, to like $85,000, theoretically, this, you know, the time to sell will be much faster, because that’s a really good deal. But there’s also a certain point at which let’s say you slide this slider down to $50,000. I mean, mathematically, you would say, well, sure, that’s like the bargain of a century, someone’s going to snap that up. But when you look at that, as a consumer, your instant reaction is there’s something wrong with the house, right? I mean, you look at the neighborhood, you look at comps, and you just sort of, even if you’re not a real estate expert, you just sort of intuitively know there’s a problem, right? There’s mold, or there’s some sort of issue. How do you think about the sort of psychology that drives like upper and lower limits?
Arian Osman 19:33
Yes. And, you know, I love that you brought up that point, because that’s actually something I had to explain. You know, we can’t … we only have so much data and the dimensionality of the problem, you know, if we had the data, we could, or if we somehow incorporated that psychology in the model, which probably we could do, and that could be like another added layer or something like that hasn’t been done yet. But it’s actually something that we’ve thought about, and we’re continuously improving the sell speed model. But if you price the home to a certain level, you know, there is that limit boundary there that it will sell, you know, definitely within 14 days, but if you move the slider even further, the graph does move back to won’t sell. And that’s kind of where that intuition comes into play. What is wrong with this house? Why is it not selling? Could it be the location? Could it be how old the house is, you know, it could be so many different things like maybe it’s in a flooding zone? You know, it’s so many different factors that can be incorporated in that. And based on what we’ve come across, this has been a test case of ours, believe it or not, so I’m just fascinated that you brought that up. That’s great. And that’s a question that we’ve been answering. And I’ve had to explain. Yeah, multiple times. Psychology, from a real estate standpoint, might be a good research paper on the theoretical side of things, definitely. And maybe a potential future project. Who knows? But I know that, you know, from a theoretical side of things, that’s definitely something that could be taken into consideration, or, you know, we add more dimensions to the model. I mean, it just depends how many dimensions you want. But also, if there is some way to combine and reduce the dimensionality of the problem to make it more simple that can be done as well. So standard data science practices can be implemented to answer such a, quote unquote, psychological problem, as applicable to real estate.
Eric Dodds 21:51
Fascinating. Absolutely fascinating.
Kostas Pardalis 21:53
So carry on. I have a question. Actually, from the beginning, when we started subscribing, like your role at Homesnap. You said that you’re actually working together with another person, and you’re more focusing on tools that are in-house built, and then the other person is focusing more on out-of-the-box tools? Can you help us understand a little bit better what’s the difference? And why does this difference exist? What does it mean, in data science, that the tool is out-of-the-box? And what does it mean that we need to build something in-house?
Arian Osman 22:23
Right. Well, first of all, it all relates to the problem that’s in question. Any third-party tool that you use, one thing that you do have to take into consideration is that certain tools have certain limits. And if it’s a new tool, if it’s a tool that’s being continuously improved upon, I mean, that stuff, it’s something that we actually have to test out. And, for example, we were testing out something with SageMaker, and the concept of inference pipelines is fairly new. So when it comes to training a model, if we can create this single model that would deal with the problem that we were having, which is a multi-class problem, how does normalization of the data work? How does the training of the data work? How does incremental training of the data work, you know, so on and so forth, these are all the things that you have to take into consideration. Now, with respect to Homesnap, when it comes to how we do things, currently, you know, and from a data scientist, you know, we like to get, we’d like to get things out. Of course, like in any organization, in any IT department or software engineering department, we like to get things out. And not everything will be perfect, but we have to determine what is the sufficient threshold as to what constitutes a good model? And does a third-party tool do what we want to do and meet that threshold? Do we have to build something custom that meets that threshold? So it’s all about the problem that you’re trying to solve? You know, the things that are based on the model, you know, if you want specificity, recall, and you know, precision and all that stuff, it depends on the problem that you’re trying to answer. And third-party tools, they react differently to whatever the problem may be. So it is our job as data scientists because nobody knows … and this, this is to be honest, nobody off the top of their head would know whether a third-party tool is better than a custom implementation. So we test things out. We use libraries, we test different versions, and different implementations of the model. And then we choose what is easiest from an infrastructure standpoint. So just to keep things moving and progressing and reporting and communicating constantly with all the other teams within the organization is very, very key, as well as identifying what the pros and cons are for each functionality we implement.
Kostas Pardalis 24:59
Very interesting. I have a question that I was always very confused on like how it works in data science because the lifecycle of a feature or a piece of software, it’s pretty well defined in my mind, and how you decide to update it or change something, or even decommission it, how it works with models? So you train the model today, let’s say, let’s forget about the details of how you can operationalize this model and turn it and productize it. What’s the life cycle of a model? I assume? And correct me if I’m wrong, that just because you build a model today doesn’t mean that this model is going to be very light for forever things change and probably has to change? How do you decide that something has to change? How do you measure that? How do you measure the performance of the model? And how do you decide when to update it?
Arian Osman 25:50
When it comes to certain implementations, it’s something that we actually do have to keep an eye on. It’s a standard in data science that you find a time interval as to when you need to relook at, well, revisit models, again, because there’s that sliding factor, you know, that can cause more errors to occur. So you have to keep things up to date. So when I mentioned incremental retraining, and validation, there’s a train set, validation set, and a test set, and you would use the most up-to-date data as possible. So you’re adding data, you may be removing some data, you have to be able to find that sweet spot that would allow you to optimize the performance of your model. So for example, you know, in sell speed, I’m only looking at the last two years of data. But I keep adding, you know, as a new month ends, you know, I keep adding that previous month, but still having that two year wide stance time interval, only because, you know, for example, the prices in 2007 significantly differ from what the prices of homes may be now. So it’s those sorts of things that you actually have to look at, in order to determine what the threshold is as to how much data you need, because more data is not necessarily always the best case scenario. It depends on the strategy as to how you are implementing the data and how you can identify trends when it comes to monthly seasonal behavior, monthly behavior, even, you know, the time of the month. I mean, for example, you would think that more people buy more homes in March or April or May, as opposed to November and December. You know, that’s a simple case. So in order to identify all these potential problems, and in order to minimize that sort of loss in accuracy, you have to answer problems like that.
Kostas Pardalis 28:08
It’s very interesting. And that’s part of understanding the domain for which the model is operating and getting inspiration from there to model this domain and figure out also when to do updates and all that stuff. But is there also like feedback that is coming from the product side of things?
Arian Osman 28:26
Kostas Pardalis 28:26
Because from what I understand, based on what you said, based on the domain of the problem, we know that signal knowledge is important, for example, right? And we have based on the seasonality update our model, but how does this work with feedback that comes from the product? And when I see the products? I mean, the product as a proxy for the customer, right? And how does this actually affect the training in the models that you build?
Arian Osman 28:49
Right. So first of all as we look at the previous month, because you know, we cannot predict the future, of course, we’d like to, but unfortunately, unfortunately, we cannot. So we actually do keep an eye on our models, whether it be a manual process, or whether it be an automated process. So we try to create either … we’re looking at third party tools or whatever. But you know, me having the experience that I have, I have little scripts that I save when it comes to looking at various losses and the accuracy of the model and you know, current keep on testing, so on and so forth. But since, you know, in this domain, and like you said it changes based on what the domain is, a monthly sort of an increment and a monthly sort of check can be done and that could be by creating a random validation set that was done for the previous month because you actually know what the results should be. So you have the trained model, but you just create a new validation set, you know, to test on it, to see how it’s progressing and whether it’s significantly higher or lower. You know, you actually have to do the analysis as to why it’s happening and, and resolve it.
Kostas Pardalis 30:06
That’s interesting. I asked many questions around how modules are maintained, and how they are part of the product lifecycle and all that stuff, I would like to move to something a little bit different. You mentioned the beginning, part of your job is to create value as part of the product for the customers of Homesnap. But you are also doing work in … So you probably, I assume, like you’re creating models and or you’re doing analyses for the needs of the company? Can you give us a little bit more color on that? What kind of stuff are you working on? What kind of teams are you working with? Like for marketing, for sales, product? How do you interact with internal stakeholders and produce value for them?
Arian Osman 30:46
Oh, absolutely. Yes. So, you know, like I said, we have a bunch of different types of subscriptions. And each subscription has, you know, certain features that are included in it. So we actually, you know, track certain metrics when it comes to usage, when it comes to, you know, when somebody clicks here, or, you know, whether they look at this property after looking at this one, you know, so on and so forth. So we look at all those different metrics, if you will. And, as a data scientist, you know, we actually collaborate with everybody in the organization. So when it comes to, you know, from, from a Homesnap standpoint, you know, we talked with the dev team, you know, when it comes to assisting with the pipeline, we talk with marketing team, with the sales team when it comes to CRM, or business analytics, and so on, and so forth. And we actually work with them to see, okay, so this product, because, again, we’re working with the people who know the most about each product, and what features are included in it. So we asked them where that data is, how can we get that data? How can we process that data into our models in a way that is more automated, as opposed to manual to give them the results that they need in order to say, calculate retention, you know. How many people are likely to subscribe again, as opposed to just let their contract end? You know, stuff like that, that’s something that we look at internally, when it comes to, you know, numbers and revenue and stuff like that there are methods that we can use to, you know, project and predict revenue. So, problems like that, that may need a little help, or may need some additional verification from the data science side so that they are satisfied with the results that we’re giving them, and that they’re producing themselves. So also, you know, we talked about probabilities as well. And, you know, we could create a scoring mechanism, you know, how likely something is going to happen, as opposed to not, and it’s up to the business team to, well, not necessarily the business team, but it’s up to, you know, the sales or marketing or whomever we’re working for, customer experience, all that stuff, all of them to determine, you know, what is that threshold where it’s an issue. So we’re constantly working with them from start to finish when it comes to the access to the data that we need, and how we utilize it, and we put in our input, what if we look at this as well? And so it’s a very collaborative sort of experience that we always endure when it comes to dealing with internal requests.
Eric Dodds 33:42
That’s encouraging to hear Arian, because it’s really not always the case. I mean, it’s getting better and better, especially in organizations, from what we hear, at least on the show, and the organizations and sort of data scientists and data engineers, that we get to interact with, I think it’s getting better, and organizations that place a really strong emphasis on sort of the data engineering data science function, because they see the value in it. But in a lot of organizations, it’s not that way, actually. And when you talk about collaboration between teams, it’s, you know, it’s not always pleasant, necessarily. And it’s kind of a, you know, we’ve heard that one, one way that we’ve heard it described that I think is good as, you know, as a data scientist, or data engineer, you sort of have internal customers, right. And they’re not always that easy of customers to please necessarily, so I’m glad that it sounds like things are running pretty efficiently at Homesnap.
Arian Osman 34:39
Well, we, you know, we have a great team on the data science side of things and we work together within our own little group. But me I’m the type of person you know, I’m fairly new, I started in late February. And you know, since I was the lead, the head of the team, you know, I actually reached out at points to say, you know, the business development team, what can we do to, you know, help you? What can we do to make your job easier? And this depends on the person as well, you know, since I’ve done consulting, and since I have customers as well, you know, from the other businesses, you know, I try to think of everybody that I work with, as a customer, I mean, even if I work with them. So, I mean, it’s not like me wanting to please people, but it’s me wanting to be able to assist in any way I can to make everybody’s job easier if I can do it. And one of the things is that, you know, I’m the most knowledgeable when it comes to data sciences, and you know, the theory behind it, as well as the general implementations as to what would be created. But it’s also important to be able to talk with these different verticals, because as a data scientist, you’ll learn something, too, you’ll learn something about the business, you’ll learn something about marketing. I mean, data science can be applied to so many different fields, and we can think and be creative as to how we want to help these people, you know, so it’s, it’s, it’s kind of like, a, it’s a collaboration, but we’re learning from one another. And one of my jobs is actually that will be happening soon, is, you know, not too many people are familiar with data science, and what metrics we look at internally to the team, I gave a presentation, but you know, just, you know, why we look at this metric, as opposed to this metric, you know, and I also presented it to management. So my boss, you know, I told him, I think it would be good if we presented this to the QA team, you know, so that they know how to test these models, and they know how to verify whether something is working correctly or not, because you can’t just necessarily focus on accuracy or any other metric you there are multiple metrics that you may have to look at. So it’s kind of like my job to also educate people, not to the point where their heads would explode. But, you know, just to have them be able to understand what we’re looking at and why we think our models are sufficient.
Eric Dodds 37:23
Yeah, I mean, if it’s almost, it almost sounds like an optimization problem in and of itself, right? Yeah, you’re taking inputs from around the organization, and trying to optimize the utility of, you know, the data science practice, which is really fascinating. Well, we’re getting close to the end here. But I want to return to something you mentioned at the beginning. And if any of our listeners have never googled MLS, just Google MLS real estate, maybe click on images, so you can see the type of sort of data and software that we’re talking about. My limited exposure tells me that it’s pretty messy, and that there probably are not really good resources, like API’s, etc. So I’d love to hear more about that. And really, just in general, when you’re dealing with real estate data, what are the unique challenges that you face? Especially as you’re trying to work with it at scale building models?
Arian Osman 38:16
Yeah, so, well, let me just tell you this, from my vast experience in dealing with data, data is never perfect.
Eric Dodds 38:26
Fair point. It’s always messy.
Arian Osman 38:28
It’s never perfect. So even if there … you know, to me, I’m kind of a perfectionist when it comes to handling data. And if I see something wrong, I’m like, oh, how am I going to fix this? Or, you know, I try to find patterns, and, you know, pattern recognition and all that stuff? No, it is our job as data scientists and data analysts to find where those anomalies are, and to establish where those patterns are. And how we do it is, you know, of course, this, this may be applied within the application itself, or it may be applied for us internally. But we have our own methods, you know, to fill those gaps as necessary, whether it be you know, some other AI model that would fill those gaps, or if it’s not too frequent, we exclude them. So it’s a whole bunch of different things that we look at when it comes to the features that we’re inputting into our models. So we look at each feature individually, we do comparisons with other features, you know, we try to find those patterns as to where these sort of behaviors occur. And that comes way before, you know, me dealing with data science. This comes for me from my DBA background and my database development background and my, you know, software development background, it’s generally a standard practice as to how you can fill those gaps. I just happen to have the skill where I could utilize data science and fill those gaps if need be, but there’s always a simpler solution if those gaps do exist. And like I said, I mean, even census data is not perfect. IRS public data is not perfect. And, and we try to utilize methods that give us at least a decent approximation. And if that whole data set is not sufficient, maybe there’s another data set that we could look at. So it’s so many different things that we take into consideration or whether or not we should just exclude, you know, certain features altogether, because maybe there are other features that are better. Sure.
Eric Dodds 40:33
And are there a lot of … let’s talk about MLS data, and if it’s a good example, if not, we can talk about something else. Do you have to do a lot of work on the data, you know, sort of, from a transformation standpoint at some point in the pipeline for it to be usable in your systems on sort of a day-to-day basis by your team?
Arian Osman 40:56
Simple answer is no.
Eric Dodds 40:59
That’s fascinating. That’s actually, that’s rare, at least from what we hear from people on the show, which is interesting.
Arian Osman 41:05
Yeah, from, from a data scientist standpoint, no. Most of the work is done by the data engineering teams. And the people that I work with are just so brilliant, and you know, they work together with the other team members that deal with the MLS, as to, you know, identify certain anomalies or do it on a case by case basis. But the data that we collect from the MLS is … me, thankfully, I don’t have to touch very much of anything. So I’m lucky in that respect. And, of course, when it comes to the architecture, you know, I work with this one other guy. And when I first started at Homesnap, like I said, I was a DBA and database developer, when I looked at the database, I was just amazed. I mean, I was just amazed at the architecture of the database and the normalization that was used and the partitions that were used. I mean, it was just, I was just very, very impressed. And I’ve only been impressed one other time when it came to that sort of thing. So I mean, we have a great team at Homesnap when it comes to getting that MLS data and cleaning it as much as we can. Because again, as data scientists, we only use a subset of that data. And the app uses you know, more of it. But from a data science standpoint, I’ve had minimal problems.
Eric Dodds 42:31
Oh, wow. That’s fascinating. Well, first, I’d love to follow up with you after the show, and perhaps get someone from the data engineering team as a guest, just because, you know, the fact that you have been so impressed by that. I think, you know, we’d love to learn more about that from them if they’re open. But one more question before we hop off. Is there … you have such a wide exposure to tools and I know you’ve built some internally, but in terms of third-party solutions really anywhere in the stack, you know, from the DBA type focus all the way down to data science specific tooling. Is there a particular tool or maybe two tools that are either newish tools that you say, wow, this is just amazing, or a tried and true tool that you have used and is a go-to for you would just love to know from a practical standpoint, for all of our listeners out there who are practitioners as well, it’s always good to hear about the different, you know, sort of arrows in the quiver of people doing this day to day.
Arian Osman 43:33
So I mentioned before Amazon SageMaker and briefly talked about inference pipelines, I think that’s going to be a great tool in the future. If you want more as, as you get more involved with models. I mean, it has certain limitations right now. But I see that evolving into something spectacular when it comes to creating pipelines. So that’s one thing. AWS SageMaker, there are a few other tools on AWS, I can’t think of off the top of my head, I think it’s Recognition or something like that. That’s pretty good with respect to image processing. When it comes to whether I’m extremely impressed by it, the thing is that, you know, Kostas, you talked about the black box, when I look at those tools, those are black boxes to me. So it’s kind of nerve-racking when it comes to dealing with black boxes, and especially in the subject matter that I’m in. So I’m always used to, you know, utilizing libraries, say in Python. I love TensorFlow. I love Keras and Python. I love deep learning models, multi-class models, so on and so forth. Actually, one thing that I do want to explore and this is something that I encourage others to as well. Look, if you haven’t yet, begin looking at the programming language, Julia. Julia is a fairly new programming language that came out I think maybe four or five years ago. I don’t know the exact, exact time. But I think I was introduced to it in 2016, 2017. And it also has the ability to encapsulate Python functionality, which is one thing I read about, I haven’t yet had the opportunity to test that out. But apparently, based on, you know, certain readings I’ve come across, it’s a good tool to explore, just because of the functionality that’s available if you need to make custom models. My models, the ones that I develop, they are more customized because of the problems that I’m having to solve. They’re more difficult than others. But when it comes to, you know, new technologies, I think Julia is going to be an up and coming game changer.
Eric Dodds 45:52
Kostas Pardalis 45:54
Yeah, absolutely. I think Julia is getting a lot of traction lately, and people are also talking about the performance of the language. And yeah, although I think it’s quite new, and probably okay, there’s still like, the toolset that needs to be built around it. I think there’s a lot of people being excited about this particular language. So that’s a pretty good point.
Arian Osman 46:14
And to add to that, I mean, yes, you know, being in multiple positions, and, you know, doing database administration, I mean, I always look at performance as well, if something is if something is running slow. Like, for example, I programmed in MATLAB, and I’ve created computational physics problems in MATLAB. If I translated those models into C, they run much faster. So it’s like all the functions that may be on the back end. But also what type of parallel processing there is, depending on how the language interacts with the machine and with the models themselves. So that’s another piece of advice to give. I mean, depending on what your implementation is, you have to see how it performs as well. So, you know, and thank you, Kostas for bringing up that great point. I mean, Julia, based on what I’ve encountered is a game changer when it comes to performance as well.
Kostas Pardalis 47:12
Yeah, absolutely. I think there are so many tools that are coming out right now and it’s still too early for this space. So as you said, I mean, you mentioned SageMaker, which has been around for a while, right. But still, I mean, there’s a lot of work to be done. And it has the potential to become like a great tool. And I think as the time passes, we will see more and more of these tools around. And it will be very interesting to see how all these tools are going to mature and what kind of tools they will become at the end.
Arian Osman 47:44
And it depends also on how you’re going to consume it as well. So I mean, that’s, that’s a, that’s a big thing. So depending on how you’re going to consume, you know, the results, or what kind of implementation you’re going to do, I mean, different tools will work for different problems.
Kostas Pardalis 48:00
Absolutely. Absolutely. And that’s one of the things with dealing with data in general is that like the context, the domain changes the way that you have to work with the data. And I think my, let’s say, product perspective, or like product management, or product strategy perspective, I think there are many things that we still have to learn when we are dealing with, let’s say, data products, or data driven products, and how we can deliver value through these to the customers. That was one of the reasons that I was asking you about how the feedback comes from products, right? How do you know when something needs to be updated and all that stuff? I think we’re still at a very, very early stage where we’re trying to figure out how to design how to approach how to get feedback, how to incorporate all that. And of course, having the right frameworks, technology infrastructure to be as efficient as possible. So the next couple of years are going to be very, very exciting times for anything data related.
Arian Osman 49:01
Oh, yeah, I absolutely agree with you. We’re still only in the beginning.
Eric Dodds 49:05
We’ll Arian it’s been a wonderful time having you on the show. I’ve learned a ton. And we look forward to maybe catching up with you in the next six months or so to see how things are going.
Arian Osman 49:18
Sounds good to me, guys.
Eric Dodds 49:22
Well, that was a fascinating conversation. I think one thing that stuck out to me, my takeaway from this show, although there are many different things, well, I’ll do two. One is the theme that we continue to see around a sort of practical human understanding that needs to be taken into consideration in data science. We heard that talking with Stephen from Immuta and his work on all sorts of different things. But just the difference between building a model in the mathematical sense and then building a model that’s actually going to be really helpful to someone. So that was really interesting to hear. The other takeaway that I thought was fascinating is that the data science and data engineering functions at Homesnap are a little bit more separate than in some other companies. So in some other companies that we’ve talked to the data engineering and data science functions have overlap in that there is a lot of involvement by data science in terms of pipeline management, and, you know, data cleanliness, and all those sorts of things. But it sounds like that function is managed almost entirely by data engineering at Homesnap, which is just interesting to hear about different ways that companies are structuring their data flow. Those are the big things for me, Kostas, what stuck out to you.
Kostas Pardalis 50:40
Yeah, I agree with you. And I think we will see these more as we start with more data scientists. And we can like figure out and experience how companies out there fracture the organization. I think it also has a lot to do with the size and the maturity, actually, of the company when it comes to data science, because at the end, these two roles should be separate. So I think that the more involved the product and the company’s data and the data science function, more of a separation, we will see that happening. One of the things that I would like to actually point out is, I really, really enjoy chatting with people who have an academic background, mainly because it looks like they are all very passionate about the stuff they’re doing. And that really gives me a lot of joy and it’s also like, very interesting to see all these people coming from the academic space, actually taking all the skills that they accumulated there and building and to become like professionals and entrepreneurs. I think, as humanity, in general, we have a lot of opportunities there with all these people. Outside of that, I found it very insightful how we heard a very interesting perspective of how we can productize data science and models. That’s very interesting. Another thing that I keep is that we are still in the early stages of how we work and what kind of technology succeeds around that stuff. So that’s super exciting for me, both from an engineering perspective, but also from an entrepreneurial perspective. I think there’s a lot of opportunity there for new products and new businesses to be built and new ways of creating value. And I’m looking forward to talking with him again.
Eric Dodds 52:26
Me too. Well thank you for joining us again on The Data Stack Show, and we will catch you next time.