Episode 18:

Data Science in Health Insurance with Jason Haupt of Bind

December 31, 2020

This week on The Data Stack Show, Kostas and Eric are joined by Jason Haupt, data science lead at Bind, a no-deductible health insurance company determined to give immediate answers and clear costs before point of care. Jason’s unique background of having a Ph.D. in particle physics and working at the Large Hadron Collider at CERN have informed the way he goes about approaching data at Bind.


Highlights from this week’s episode include:

  • Jason’s background in particle physics and his path to Bind (2:53)
  • A cloud-only approach to data and utilizing AWS (9:01)
  • Focusing on activities that help its members (12:08)
  • Dealing with 12,000 columns of data from an insurance claim form (17:13)
  • Rethinking the relationship between marketing and product teams (25:28)
  • Examining the data pipeline (29:30)
  • Privacy and security concerns with medical information (35:45)
  • How experience with the LHC impacted the way he thinks about data (40:06)
  • Transition from academic work to industry (46:20)


The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.


Eric Dodds  00:06

Welcome back to The Data Stack Show. We hope your holiday season has been wonderful. We have a very interesting guest today, Jason, from a company called Bind. Jason has a very interesting background actually coming from the world of physics, and specifically academia. And, you know, hopefully we get to talk to him about some of his work there. And Bind is a fascinating company, they are doing a lot of interesting work in the healthcare space, and bringing price transparency to health insurance, which is fascinating. So I’m extremely excited to meet Jason and learn about his sort of data science practice at Bind. Kostas, he’s such an interesting guy, what do you want to ask him about?

Kostas Pardalis  00:53

I think the most important aspects of our conversation today is that we are going to discuss with a data scientist and actually pretty hardcore one, which is great, because usually now so far, we mainly have like people from data engineering. And we are covering things that have to do more of like the typical data stack around BI and the standards, analytics that a company implements as the first step into becoming data driven. So today, we’re going to chat with someone who has a very strong background in data science. So I’m pretty sure that we will have the opportunity to discuss a bit of more advanced, let’s say, analytics use cases. So this is super interesting for me. And another thing that’s going to be, I think, a big part of our conversation is around data privacy, and how you work with sensitive data in general, I think Bind is a very good example of a combination that has to work with very sensitive data. And it will be super interesting to see from the data scientists perspective, what privacy means. And how at the end, you can deliver value without compromising, let’s say the privacy and the security of the people that are like trusting you with their data. So yeah, I think it’s going to be very interesting. Hopefully, we will also learn something around physics, we’ll see. But yeah, let’s do it.

Eric Dodds  02:20

Let’s dive in. We have a really exciting guest, Jason from Bind, Bind is doing some really interesting things in the healthcare benefits space. So we’ll hear about that. But first of all, welcome to the show, Jason, thanks for joining us.

Jason Haupt  02:35

Yeah, thanks a lot, Eric. I’m very interested in having a conversation with you guys today.

Eric Dodds  02:40

We are too. Well, let’s start out. Could you just give us a brief background on yourself. And then just a high level overview of what Bind as a company is doing in the healthcare space?

Jason Haupt  02:53

Yeah, really good. So I myself got my PhD in particle physics, worked over at CERN for a long time. For me, I used to always say petabyte is a small data set because it was easy to run 20,000 jobs overnight and process data. Left that to go into the industry, ended up in healthcare, just kind of happened local to the Minneapolis region. I worked for a provider, a large local provider organization for a while. So it means hospitals and clinics for a few years, built a team until that team got acquired by Health Catalyst, a startup that just went IPO last year, out in Salt Lake City. And when that acquisition occurred, I moved to the insurance world, did that for a few years and led a team of several hundred, building with a few petabytes of data internally to the unit healthcare, building a lot of assets, a lot of their benefits services. And all of a sudden, I got a call someday from a startup in Minneapolis that they had a new way of doing things. So I listened to the pitch. And yeah, that was right. I felt I wanted to be part of the solution in what Bind is, and actually maybe change some of the fundamental structures that I felt were not right about how health insurance operated. So Eric, you want me to get in a little bit and tell you about what Bind is?

Eric Dodds  04:10

That’d be great. I mean, you know, healthcare is such an interesting space. And it seems like Bind is doing some really interesting things. So yeah, an overview of, of how y’all are trying to change things would be great.

Jason Haupt  04:21

Yeah. And the one way is to compare what other people are doing and the other is compared to expectations, right? One of my favorite ways of describing what Bind does is taking the consumer approach. And I’ll give you a couple examples, right? If you think about the way healthcare works and the way you expect consumer interactions to work  in your day to day interactions, for instance, let’s say you take your credit card, you decide to stop and get gas, you swipe it, and you drive away. Imagine you don’t sit there and pray or hope or whatever you may do. Such that I hope that when it appears on my credit card bill in three to four weeks it was only $50. Right? There’s no disconnect between the price you pay and when the transaction occurs. Similarly, you’re not going to go from where you guys are located to Vegas for the weekend and come back and hope Delta, United, whomever you fly with only charges you $500 for the ticket in a month or $2,000. Right? Those are the type of price swings that we see in healthcare. It could be a couple hundred, it could be a couple thousand. So the fundamental problem we have here is: the consumer marketplace doesn’t exist. So what does my team do? My team has data on tens of millions of Americans in one data and almost 200 million Americans in another data set about their experiences, their claims and other experiences with health care. We look for those patterns of how people experience health care, both cost efficiency and quality. And it’s as simple as this. We rank everybody, every provider in the space. And then what do we do, because we’re the insurance company, I put a different price tag on everybody. And what we do then is we expose that price tag to the members. And guess what, what they see is what they pay, they can look it up in their app. If they’re not app savvy, or website savvy, they can call us, right. And like you want an MRI right there, 100 bucks, you want an MRI down there $2,000. And what this does in the end, it’s open access, so we don’t restrict the access, right, we have a broad nationwide network. So we’re not one of those companies that are out there, like, ooh, we’ll just find the cheapest person, that’s the only place you can go. We think that’s horrible. We just price everybody and consumers can make a decision with their wallets. Right, and we incentivize them appropriately. Let’s say for back surgery, this one might be $500, because they only charge $20,000 to the employer, or this one might be $5,000, because they charge $200,000 to the employer, employer saves 150K, you as an individual, just save yourself 4,500 bucks. And guess what, it works. Simple as that we found these categories where if you introduce price variations, and some of our products like Bind On Demand, there’s an activation component where you can activate additional insurance coverage on demand, or Bind Basic where there’s no activations required. Simple as that, we keep finding these categories where if you just show people the prices, enough people are going to make decisions and save 10s of percentages, and sometimes 20 or more of the overall health insurance cost for some of our employer groups. This actually works. And we’ve been scaling, I can tell you about some of our clients, but just an idea. We are a multiple x growth company one-one is essentially what it usually looks like for us.

Kostas Pardalis  07:38

Are you going to start using the product?

Eric Dodds  07:40

Am I going to start using it?

Jason Haupt  07:42

Well, we’re not on the individual market yet. So your employer would currently have to. We operate technically as a TPA in most of our business, we’re fully insured, and let’s say the state of Florida, but we will eventually have an individual product that you can get on the marketplaces and various states. But right now your employer has to have Bind as an insurance option. Right? Or for you to be able to select it some Bind is the only option for others Bind is an option amongst a handful of others.

Kostas Pardalis  08:12

Right, right. Yeah, actually, my comment was more about your pitching, which I think you did an amazing job pitching the product and the business. Yeah, I think both Eric and I are sold on it already. Cool. Jason, do you want to get like, into a bit of more like the technical details, like how this works? You mentioned, you made some mention about like, the size of the data sets that you’re working with? And I think it’s pretty clear that okay, like, a big part of the product itself is based on the analysis that you’re doing on data. So from a technical perspective, how does this look, like what kind of technologies you are using? And then we can also discuss a little bit more about methodologies, and what kind of analysis you are doing on the data?

Jason Haupt  09:01

Yeah, and, I mean, one of the things, you know, compared to, you know, working just previously at the, you know, a fortune 10 company, very large, a lot of data assets, you know, national company, is you can see what slows them down. Right. So Bind is taken a cloud only approach to how we deal with our data, and allows us to take AWS services as we need to scale them, and use them in a way that, you know, just was so hard to do when you have these merger and acquisition on prem solutions that are just really slow to catch up. I will say one thing I’ll know about the big COs is they’re getting better with their modernization strategies. So they’re getting to a point where they’re getting more and more cloud based, and more and more able to scale some of their low level functions as well some of their medium and high level functions. That’s great for them. They’re on a multi year journey to be able to have basically modern services that are cloud based that can actually scale or not, you know, oh, that’ll take two to three years just to optimize something 10%. But we get to start off, hey, it’s cloud based, I can click a button AWS and double or triple my database in minutes, right, depending on how the Redshift shards work, or I can go with more of an online solution, right. So most Bind apps are, you know, a Java-based back end micro service approach, sitting in a very heavily secured AWS account. And my team is more of a Python-based data science team. We have some things that are predictive models in production. But essentially, if you think about it, from a Python perspective, AWS services have only been really coming live the last year, last year and a half, my team’s already over three years old. So a lot of what we had were custom built. And we’ve kind of gone through one-level modernization there as well, where we’re using SageMaker step functions, lambdas, a bunch of those AWS technologies that are allowing us to build model inference pipelines or data transformation pipelines. And a lot of our data transformation right now is done in Pi Spark, just to give you an example of the type of things that we’re doing.

Kostas Pardalis  11:15

Oh, that’s great. So as a data team, can you … let’s abstract a little bit like your work, and let’s talk in terms of inputs and outputs. What’s the inputs that your team gets, which my assumptions are, like, their own data that you work with? And I’d love to hear a little bit more about like, what kind of data you’re working with and the sources? And what’s the outputs, which probably might be easy to model is like, how does this work? And how do you update and how you actually turn the result of your work into  a product at the end, something that the end customer can use, which I think it’s super interesting to hear about, because my experience so far, to be honest, it’s more about people who are doing more ad hoc analysis or like in the BI space, and the kind of work that you are doing and how you can turn this into a product. I think it’s something very fascinating and still something that the industry is trying to figure out. Right. There are still tools that are built for that means.

Jason Haupt  12:08

Yeah, that’s a really good question. And this was one of my bigger concerns when I had a large team at the large insurance company. I was worried about taking the flavor of the day in terms of the tool, getting vendor lock, and then the different tool a year, year and a half would be better, because it’s just a better product offering at that time. And I ran into that a few times, right, having a larger team and running into that problem. And it was kind of annoying. So sometimes I just hired a bunch of job developers and developed a tool that met my needs. Rather than trying to find the one that I felt, I was willing to go with a little bit of vendor lock with right being not able to trust that they’re going to move in the direction with me. So sometimes that’s been very successful. Sometimes it’s not. So what do we actually do here at Bind: my inputs, medical claims. So we have both the plan that we’re operating the 100 plus 1000 members, and then that more than doubles, come one, one, when we’re talking about hundreds of thousands of members that we have on our plan in just a little over a week’s time. What we do is we have their medical claims coming in, then we have their ability of their other touch points that we have. There are x claims, other sorts of interactions, right? Whether they’re eligible, one of the providers, like, does this person have insurance? Person walks into a doctor’s office, they’ll send in some sort of query that says, hey, does this person have it? So I think about inputs as signals, each of these things as a signal a claim as a signal, someone getting a prescription as a signal, a doctor checking things is a signal from a operations perspective, plus, a good percentage of our members login with our app, right, they sign up, they log in, they begin to search, all of those become more signals that I can use about member behavior, that I can link into outcomes. Plus, then I can go to the market by some of these historical things on these trends or partner with other organizations to get these tens of millions, in some cases, almost hundreds of millions of other, you know, historical data, that’s other signals that I can use. My team takes that historical data, we look for these patterns, and then we implement a product based on the pattern. So ranking all providers based on a myriad of algorithmic things. That’s something my team does. That gets loaded into the product. And what you see as a price tag for every provider, what you see as a price tag for every service, right? That’s something we deliver. And also in the other sense, we take that historical data to build models, we can take these patterns and predict what’s likely to happen next. And then we put this into our martech stack. If you’re not familiar, that’s the marketing technology stack that allows us to fuel our member engagements or our internal marketing strategy. So based on something, and now the next time you log into the app, it might say, hey, it looks like you’re about to head down surgery, are you interested in a free second opinion service, right? That’s just giving you an example of one of our internal marketing campaigns that are fueled based off analytics and services, which I can tell you, you know, there’s a lot of, I’ve had researchy jobs before in the past, but I heavily focus on things that are driving value to my members. In fact, if my team’s like, Oh, we want to improve this algorithm. I’m gonna say, well, let’s look at the roadmap, how is this actually going to help our members? What’s the likelihood that it’s going to help our members, and we try to focus our activities on things that are going to help them?

Kostas Pardalis  15:42

That’s amazing. And I think Eric will have a couple of questions to ask about the marketing tools and how you work with them. But before we go there, my last question, data related for me, and then we can return to that is about you said something very, very interesting. You talked about signals, like all these data points that are coming there, and like actually signals that you combine together, and the end result of the price point like price value, in my mind, what I find extremely fascinating, it might be because of me, but is how you start from something that has so many dimensions, all the signals that you are talking about, because I think the problem with a bit of a problem with the term signal is that people might tend to think that it’s something very one dimensional. But usually, these data points might be quite complex. And you through all the models that you’ve built to collapse these into something that’s like a numerical value that someone can use. And for me, this whole process in this kind of magic, that data science, and all these algorithms do are like, it’s amazing. But can you share a little bit more about the structure of the signals that you’re working with, what they look like? You talked about claims, right, I think most people will think of a claim as a  document that they have to fill out, right. So how does a claim look like from the perspective of a data scientist, and what’s the complexity, and what kind of preparation do you have to make on these data in order to turn them into signals that then you can apply all these algorithms and turn them into value at the end?

Jason Haupt  17:13

Yeah, that’s a really good question. I’ll give you an example both of the claim and I’ll give you an example, as well, based on our kind of member direct search experiences. But let’s let’s unpack that example of the claim, right? The claim comes in what’s called an X12 EDI format, electronic data interchange format, that’s been around for quite some time, this format, very compact, can be unpacked. And if you think about it a couple times in the past, I’ve had people write Java parsers. to unpack. When you unpack this into a data warehouse, for a professional or an institutional claim, you usually end up, and I’m not joking here, with between 10 and 11 thousand columns, right? It’s a very sparse thing. Not all of those columns are populated. But sometimes there’s things that are given but some things you don’t know. So you really need to. So the structure of a medical claim, because you think about it from this paper form gets transferred into 12,000 columns, sparsity. Yes, for a single claim. It’s because there’s so many loops that are allowed, right? So you can have 25 diagnostic codes for every procedure code, right. And then every procedure code can have a different assigned pointer to a description as to why, right? So that creates these, if you think about it, if you’re JSON or XML in mindset, you think about these nested loops, right, and why it’s so compactified. But if you wanted to unpack it, and denorm it as much as you want, done this activity in the BigCo and it becomes big. What we found out is you can create levels from that. Variables, because you know, people spend time with that very unpacked version of several hundred key variables to several dozens of key variables. So I’ve gone through that activity. It was kind of interesting, when I left the startup, to just develop a big product, a couple of teams adjacent to me and myself, my own team, this kind of real time online claims processing system in micro batch. That is a claim comes in the door, it issued fraud predictions within minutes, right? I think it ran every 10 minutes. It was great architecture. Then I read Ubers Michelangelo architecture and what I left him like, ah, an online offline, if you haven’t yet it was and they’ve had a couple articles since 2018 and again, in 2019. I’m like that’s very much like the architecture that we had built about taking things in the database, unpacking it, creating an online version of maybe taking those top one or two hundred features or putting them into a feature store and then building all of your models on that feature store. So yeah, so when I say this is a signal it is not one dimensional. It could be 10K to 12K dimensional, but when I’m actually running my models, I’ve already limited it down to those couple hundred features or so that are key, especially for things that are an online offline, I can keep a few more. But to be honest, you know, it’s so sparse that going beyond a few 100, isn’t there. So that’s an example. And what’s interesting, my team even did that here, at Bind, we unpacked that format, we picked out the top 20, or 40, features or variables that mattered, built our models, specifically on those features, and therefore deployed our models specifically on those features to get you know, depending on what we were trying to predict varying degrees of success, some of which now are impacting our members positively. So if you’re interested, I can tell you the other space is search, right? We have a type ad as you’re searching. Every time you click, and I had another variable, we have metadata about that search. So you can think about it as like, oh, you type in diabetes. But by the time you get to the Yes, I’ve known every, you know, every single, I have a row in my database for every letter you’ve typed. And I’ve known what search results exist. I know what a search attempt looks like, I know if you went back and went forward, and what your final search was. So even though some people would say, oh, it’s a signal, what do they search for? Well, I’ve got metadata stored in my logs for every keystroke you made, which allows me to make sure my search is working effectively. Right? They’re finding the things quickly. They’re not, they’re not misspelling things. It’s providing “diabetes” quickly for them. Right? Those are the types of experiences that we enable by just looking at all the data.

Kostas Pardalis  21:39

Oh, that’s amazing. I do one last technical question before I let Eric ask his questions, but I’m really getting excited with that stuff. So you talked about unpacking the format. And it’s very sparse, as you said, and you also mentioned Redshift. So can you give us a little bit more of like, technical description of how this unpacking happens, like from the JSON or XML document? Whatever it is? Do you end up like with 10,000 columns from before you start creating the features? And the reason I’m asking is, because I know about the limitations of Redshift, like, for example, you can’t have a table with more than 1500 like columns. So I’m very interested to see how you manage this dimensional explosion that you have with the limitations of a data storage system like Redshift.

Jason Haupt  22:26

Yeah, so in this role, so previous role, we unpacked it, we had everything we were using HBase. So because of that ability to just hold the entire object there at the BigCo. And then we would create HBase tables that are, you know, reduce feature sets, right, so that worked fine. But now, it doesn’t make sense for us to unpack the entire thing, because we already know every field is not valuable. Or we can do that in a future state. So we define a schema let’s say it’s, you know, a JSON format. And then we have a schema on top of that, we unpack that schema that we define, right? So it’s only those variables that we’ve determined to unpack out of it. If you want to think about this, from a technical perspective, we are definitely an orchestration organization, right? Kafka was central to the way we set things up. So we have these engines that go in, this schema gets unpacked, once it gets put into a Kafka topic that anybody that needs to use that then can use it, right. So there’s something that listens to that topic, and then instantiates that unpacking into an analytics ready database that I just talked about, right? There’s other people that use that to actually subscribe to that outcome to actually begin to process that claim, right to actually adjudicate it and determine what the actual price should be, how much the provider is owed, how much the member may or may not owe, etc, and how much the employer needs to pay. So we have a bit of many micro services that allow these transfers and these processes to occur.

Kostas Pardalis  24:00

That’s great. All right, Eric, He’s all yours. I know that you have many questions to ask, but you know me, like I get too excited sometimes.

Eric Dodds  24:10

I mean, it really is fascinating. I just love all the unique things that we learn on the show like a medical claim producing you know 11,000 columns. It’s wild. Jason, I’m interested in the sort of customer experience aspect of what you discussed as far as the outputs of the data. And I have two questions there. One is about the interaction between the data science team and the marketing team. And the second is about this, the technical piping that sort of connects your work with the martech stack, as you said, but let’s start with the relationship between the data science team and the marketing team or other people driving customer experiences. And specifically I’m thinking about even the example that you gave around, you know, providing a customer who opens the app with a recommendation on a free second opinion. Where does an effort like that originate? Is that coming from marketing or coming from someone in product? And then depending on where it originates, how do you work with those teams to sort of produce the output that they need from your work?

Jason Haupt  25:28

Yeah, so from where in an organization is this owned, let’s say that that’s been something that has changed because we’re still trying to find the optimal structure. So when this first came out, there was a product owner. Let’s think about this, as a store, you had inventory, you have these SKUs, you have these things that people can purchase, right? Things that have price tags, so providers doing this thing somewhere, is a SKU. So you’d inventory. We also thought about it from a retail store concept. Then you have merchandising, right? How do you arrange the things in the store, such that people can see things, right, you put eye level around the end caps, things you want to highlight to people. So we had in our product division, we still have an inventory function, we had a merchandising function and the person who owned merchandising was in charge of basically figuring out how things get stocked, right? Where, where they were, from a visual perspective, think about it in the app, you know, how do we highlight things, we’ve since kind of changed that function, it served us very well for this, to now we have a member experience function within the business within our operational business. They are kind of more in charge of that, you know, call that the arrangement of how things are still within the store, right, I want to stick with that retail construct. Our marketing team plays, making sure that the technologies are there, that our brand makes sense. And you think that they own a lot of aspects of it, right, how to develop the kind of the front end of that. For instance, the videos that we’re going to see, the images that are embedded onto the machines of the potential people that are going to select us, right. So our marketing team is usually focused on selling Bind out of the front door, and then selling Bind to the employees within these organizations, or at least giving them information so they can make the choice for Bind. We love to be in choice environments, we, we don’t in many situations, we don’t want to be the only option. We want people to choose us over their high deductible health plan, we want to say, but you know, one thing I didn’t tell you, there’s no deductible with Bind, right. There’s no coinsurance, you don’t have to hit some number before Bind kicks in if this is a $100 MRI, that’s all you’re gonna pay, the $100.

Eric Dodds  27:49

I read that on the site, which is awesome. It got Kostas and I excited, we’re gonna go back to our employer and ask them to sign up.

Jason Haupt  27:59

So we just want to give people that information, that for many people, go to our website, type in the things that you care about, is it diabetes, is it this drug? Is this better for you? Right, so our marketing team focuses heavily on those, the upfront experience of helping to sell or at least provide Bind to those HR managers and then to the employee level, being able to understand that during these annual enrollment events, when people are given the option to select Bind, or to not select Bind, the information, they may need to make a good decision on their own behalf. Right? So it’s a great relationship. We’ve hired some brilliant people that I really enjoy working with. So I’m really happy with the way we’re structured.

Eric Dodds  28:45

Very cool. And jumping over to the technical side. Could you explain, and I realize, you know, this may or may not, you know, be under your purview from a technical standpoint? Or maybe it is, but you talked about sort of, you know, pulling data in and then processing it? How does it go from the infrastructure that you and your team leverage through to the end user experience? Right, so let’s say they open the app, and they get a notification? What are the pipelines that actually drive that experience? And how does the data get from you? You know, sort of to the places where it’s going to be activated for the customer?

Jason Haupt  29:30

Yeah, that’s a really good question. And I would say one of the best ways for me to answer that experience is going back to architecture being all the AWS space, allows us to have some of these integrations be far more streamline than some of the on prem companies, right, that maybe not have thought about these interactivity or connections in their original design, or use cases. So let’s say back to the orchestration engine, I can just publish my model to Kafka with a model topic, right? And then my martech stack can listen to that, right? As long as I have some sort of data contract with the marketing team, or with the product owner of this stack, that they know what that thing means what that structure of that thing I published in, and sometimes when I’m early on and not ready for full production, I might publish it to a database in the query that database, right, that fuels into, let’s say, Segment, or whatever tools that you guys are familiar with, that now is now the martech stack that now understands how multimode interaction occurs, be it phone calling people, be it emailing folks, be it fax, be it in-app notifications. So using, basically, if you’re talking about just that transaction, that stack can just listen to Kafka and fuel its data stores, that stack can just query a database and fuel its data stores through configuration. And then that team that manages the marketing and merchandising function, can then configure those campaigns, right within those tools, based on that information that was loaded in. And if you want me to get more technical, I can. But that’s kind of the way I like to describe that.

Eric Dodds  31:14

No, I mean, we have the benefit of seeing just a lot of these different setups. And, you know, the way that you have approached it is very modern and very streamlined. One thing I’m interested in is in the development of the martech stack, I mean, it makes absolute sense that you would have a pipeline that the marketing stack can listen to, and then sort of just receive the information they need, and then, you know, route it and do the things you need to do with it. Were you involved in sort of the architecture of the marketing tech stack, as well? And was that system sort of, you know, from a claim coming in to going through your pipelines and data science to you know, sort of publishing that in a way that the marketing tech stack can listen to it? Were you involved in that? Or had they sort of, you know, architected their system separately, and you, you know, built your Kafka pipeline to suit?

Jason Haupt  32:15

Yes, so in this instance, I was aware, but not involved in the choice of technology for the Martech stack. I was aware they were doing it. I was understanding which vendors they were. But I was not a key stakeholder in that process. So it was more of, Hmm, here’s our … it was the data contract, if you want to think about it from that concept, was, how am I going to get you data? Great. Look, we’re a Kafka organization, I can read Kafka, just put it there, right, from an orchestration standpoint. So we had this going in position, such that we already had a method of communication, so they can go off and with whatever all the use cases that they wanted this, you know, to work for, right to manage marketing campaigns, you really want that ability to manage the app notifications, to manage the email notifications with modern tech, right? You just do, you’re not going to build your own job application for that, it exists in the marketplace. So we just had to make sure, hey, here’s information and how to load it in there with you know, kind of advanced analytic techniques. So came down to that data contract. I feel we, we did well, with that.

Eric Dodds  33:25

Yeah, and, you know, it’s interesting, we just, you know, we actually wrote a post recently about sort of the history of data engineering, and one of the points we brought up was that, you know, IT and marketing, there’s kind of been a schism between the two groups within a lot of organizations, because IT was seen as sort of a limiter, right, like, h, we don’t want to go to IT, because it’s going to take longer, and they’re not going to give us what we need. And, you know, or they’re gonna say no, and so it’s just really exciting for me, especially coming from the marketing side, to hear about a partnership, that’s actually, you know, seems to really be driving better value and better experiences for the customer. And I think that’s where things are gonna go in the future. You know, as companies really figured out that that creates a competitive advantage. So I’m really excited to hear about that structure at Bind.

Jason Haupt  34:15

To kind of just, you know, hit it a little bit more when I think about where, where this technology is going, we’ve still got a lot of opportunity to enable it even more. Right. That’s, that’s the clincher there. When I think about organizations that are stumbling over themselves to kind of get things in there. I don’t think that’s our biggest problem. To be honest, our biggest problem is making sure that consumers can understand our information in a way that’s valuable to them. It’s usually not a technology problem. That’s not our biggest thing. It’s understanding the user experience and optimizing that is, I think, where you become a consumer oriented organization, like I said, as long as you are up front with the technology, then we can actually focus on what really matters, creating a consumer experience that actually works for people.

Eric Dodds  35:05

Sure, the technology gets out of the way and you can focus on the user, which is, which is the whole point? One question, speaking about IT, sort of, you know, the issues that marketing has with IT, we can’t talk about health care data without talking about security and privacy. And, you know, insurance and health care are extremely regulated, in terms of data privacy and security. So how does that impact your work as a data scientist, and you’re obviously sort of dealing directly with the sensitive data, I would just love to know the types of things that you deal with on the data science team related to security and privacy.

Jason Haupt  35:45

Yeah, and the interesting thing about Bind, it’s, I think, the most secure PHI organization that I’ve ever been part of just to kind of throw that out compared to the bigger companies I worked for. So when I, when I say that I mean in the day to day operation, right? So we take a very strong dev prod mindset, right? Most people at Bind have no access to prod. In fact, very few developers do, right? So they must develop their code on test data, dummy data, implemented, tested, put into the pipeline, even the data scientists need to do this, right? to a point then when we want to deploy code into production, that’s when we get to see it. And certain variables are covered. So only a few people have access to the PHI itself, right? It makes it harder to develop, when you need to live in the dev stage prod or you know, whatever paradigm, you need to develop that way. Doing it with data science is a little weird, but we’ve figured out ways to make it better. But it takes longer to live in that paradigm. So when I say it’s more secure, I just meant it was easier to get full data access at some of the other companies. But it was really hard for them to get the data off the computers. Let me put it that way. Right. I’d say, Yep, this person has access to 100 million Americans’ data. But it’s impossible for data to leave, right? They’ve got the machine so locked down. The possibility for breach is very, very, very small. But it was much easier to be like, yep, this person gets full access, because it’s part of their job. They need it. Right.

Eric Dodds  37:17

Sure. And does that impact the way that you train models on the data science side? Do you, you know, think through test data, and your development flow? How does that look on the team?

Jason Haupt  37:30

Yeah, for the most part, things that are PII or PHI, most of those identifiers are unimportant from a modeling perspective, right? I don’t need to know someone’s name. I don’t even necessarily need their address or stuff like that, right. So if I want to pull in because we do live in the age of checking for equities in which we do, right, I sometimes might take their zip code link it in with socio economic data I might have about that region they live in, to make sure that we have an equitable product, right. And equitable outcomes in terms of how people experience Bind, right. So those are things that are important to me. But age is an important variable. But most of the PII can get blinded from a modeling perspective, right? Which is really nice. The only time that I need to put it in sometimes is if I’m providing output to an operations team that’s now going to go do something. So if we have a model that’s predicting people that are going to be high cost and some sort of condition category, that data needs to be plugged into a clinical ops team that might call them or might try to help that person make good decisions on the strategy, right? This isn’t all just app based, we operate a product. And we sometimes will just call the folks and make sure they have all the information they need, about their benefits to make a decision. So that happens. So we might have a couple folks with kind of like that front end, but we are heavily regulated, heavily locked down. And we have a very good dev sec ops, development ops security team, that really make sure our data is protected.

Eric Dodds  39:12

Very cool. Yeah, that is very interesting. On the modeling side, in terms of the data, you actually need to accomplish what you need to accomplish.

Jason Haupt  39:23

Yeah, I mean, it’s the ages, but you can group age, you can do age since January one, because 66.5 and 66, trust me, are not sensitive to almost every model I’ve ever seen in the healthcare space. So just to give you an example, we find that we’re able to strip out the PHI for a modeling perspective pretty readily.

Kostas Pardalis  39:48

So Jason, I have a question that’s still related with the dimensionality of the data, but I really have to ask this to you because probably you are like one of the few who can answer that. What is more complex in terms of dimensionality, is it a medical claim or a measurement in the LHC (Large Hadron Collider).

Jason Haupt  40:06

From a dimensionality, I would say probably measurements in LHC are far more complex now. So I worked on something called the electromagnetic calorimeter when I was there, it had, if I remember, I think it was like 64,000 crystals. So just one part, one sub detector had 64,000 crystals. And the measurements of the energy were sampled every 25 nanoseconds. So you had to reconstruct the energy profile. So just that let’s say it was 10, or 15, batches of 60,000. So that’s already telling you, you’re dealing with more than a million measurements. And then for every crystal, you reconstruct the energy profile, then you run a bunch of higher levels. Was it an electron, was it a proton? Where was it going? All these higher level things. That’s just one thing, the hadronic calorimeter, there’s a tracking thing. So from a data element, for every one interaction, you’re talking about billions. It’s pretty much the most, these devices are pretty much the most highly instrumented spaces that exist. And I only gave you the surface of how instrumented that was, I think when it was designed in ’93, it was designed with more fiber optics than existed or was laid in the world at the time. Now, in the world a lot more fiber was laid during the actual development of it. But that just gives you an idea of how instrumented that was.

Kostas Pardalis  41:37

I think every time an SRE complains about all the different metrics that they have to measure every day, they should have like, a conversation with you. So they can feel better in terms of like, what data points they have to keep track of every day. That’s great. Actually, I think it’s very interesting because of your background. And that’s why I asked this question, because you have a very interesting background in terms of like data science coming from doing your PhD in physics at LHC. So, you know, we’re always talking about big data, we are talking about the scale of the data problems that they just raise, like facing every day and other stuff. You are a person who has probably been involved in like one of the most complex, in terms of data, projects that humanity has come to so far. So can you share a little bit of like, around about that, like, how does it feel coming from the CERN experiments to go into the industry? And so what difference did you see there? And what also, I think it’s quite important, what kind of lessons you learned there that you’re still applying today and you find like, very useful?

Jason Haupt  42:42

Yeah, so the most interesting thing for me about the scalability question is most times when I find somebody, be it in industry, or not, saying, oh, boy, this is just impossible to do. This doesn’t scale. I look at it and I’m like, this isn’t hard at all. I mean, sometimes they’re talking about going from one gig to 10 gigs. And it’s because they’ve chosen a tool that’s in memory, RAM. And there’s a solution for that. Yeah, none of these are mind-stopping. Similar to before. It’s like, I had a dataset that had 900 terabytes, or one that had 1.3 petabytes. I’m talking about 2010. This wasn’t that hard. We wrote C++ programming over many years that would put this on the worldwide grid, the grid might, you know, run 10,000 jobs in Turino, it might run 10,000 jobs in Chicago, and 2000, you know, and would kick it back. And it would be formatted in a way and unpacked and you know, in eight hours, I’d have 40,000 jobs done. Oh, I made a mistake, I run 40,000 more jobs. So it’s kind of funny, almost every time someone has shown me a scale problem in industry, there’s already a solution that humans have figured out for other purposes. So it’s been really kind of funny that when I look at it, they don’t think outside of the box. So this machine’s only got 32 gigs of RAM, I need to run 32 gigs, I just can’t do it. And then I’m like, well, we could just put this on a bigger machine, at least solves it for today. But we could use something that doesn’t require in-RAM analytics. So I haven’t yet come to a problem that I hadn’t already seen a solution. I actually thought MapReduce was so backwards when I left  because the way we had done things at CERN is you unpack it, you do all your analytics at once and then you repack it. And it was a C++ so you can add templated classes. So when map MapReduce Two came out, I was much happier, and when Spark came out, I was a lot happier. But I still thought they hadn’t met what you know, the physics community had already done on these larger data sets. But they’ve definitely made it better since then. And just to kind of hit one more thing on the research side. I find people that have gone through this rigorous level of research that have been kind of data scientists for the large research things do very well, right. I have starting in January my third PhD physicist on the team, but I have plenty of other folks, you know, masters in bio, master in behavioral health, who add a lot of statistical health rigor to the types of things we do. But in other areas, I’ve had people fail, who have been in the research world, and they just keep going down rabbit holes and will spend two weeks on hyper parameter tuning. And knowing when the business is going to get value has been a very tough thing to learn for some folks, where, wait, when is basically perfection the enemy of good enough. I love that phrase.

Kostas Pardalis  45:34

Yeah, no. And it’s amazing to hear that from a person who’s coming from academia. Because I’m also a person who worked in academia for quite a long time. I can relate to what you’re describing. And it’s amazing to hear from someone like you that you understand this distinction. And that’s, that’s great, actually. So going from academia to industry, more on a personal level and more on a professional level, the chain mainly how, how did you choose to do that? And what are the differences that you see there? I know that there are many people, but they are going after PhDs. And they might be like thinking about that. So I think it would be great to hear from someone who has done other things that you regret, or things that came out to be like, much better than you expected. And what’s the overall experience that you can share with us?

Jason Haupt  46:20

Yeah, so I would say it’s very hard to leave academics sometimes and go into the industry. I know, there’s a lot of programs out there, be it. There’s a few now, fellowship programs, right, that try to take people with MDS and PhDs, and give them data science or data engineering skills, right. And those can produce folks that now have some understanding of business value, right? Which is one good thing that can come out of those programs, right? Some people go get a master’s in business analytics from the business schools. And those as well come out with people that can understand business value without needing to be taught that from the get go. I sort of got lucky with the role I had. I left the industry, it was super busy, got my PhD, the thought of doing a postdoc in that field, I mean, postdoc is over 10 years, the thought of moving my family every two years to various institutions around the world was not enticing. I wanted to kind of dig in, develop my family, and you know, develop my career, be compensated, okay for that. And so I just kind of fell into healthcare an opportunity happened, got some experience and then I dug in, right. I showed up to that first job with a tie every day, right? I did that in that organization, which mattered. So when they needed a manager, and as people adding value, one manager moved, the director was easy for them to select me, I’d already been providing that value to the organization. So for me, it was that focus on the business value that allowed me to get my feet in the door, and allowed me to continue to move to do the things that I wanted to do. So I just kept saying, like, not what did I find interesting, it’s what do I find interesting. That matters, right? Because the first thing I did, there was one of the first things is, I built a model that predicted if people are going to come back to the hospital. Readmission models are still very popular, they were still popular in 2011 when I built one, and then I can meet with providers, discharging people from the hospital, built model, put on dashboard, the dashboard refreshed every hour, they could see these colors based on my models, that would say, oh, this person’s got a 20% chance of coming back in 30 days, worked with them to develop interventions, and they work to mitigate that. That was cool. That added value, made me feel good, saving people’s lives, by providing information to doctors and hospitals on these screens that social workers can pay attention to.

Kostas Pardalis  48:44

It makes sense. I mean, I think from my experience, my perspective, I mean, usually people that they go after like an academic career, in many cases, because there is a passion behind that right? Going and doing a PhD in particle physics, you need to be passionate about something to go and do that. The same applies also not only in physics, but in other disciplines. So I think what is quite important, and I think this is the responsibility of the industry to figure out how to do it, it’s when you try to attract these people out of academia and give them inside the industry. I think that it’s also important outside of like, okay, the monetary benefits of doing that, just to try and see how these people can get passionate about the problems that they are going to be solving. And that’s what I get also from what you’ve said about this first problem that you solved in healthcare and how this drives your passion in working with data. And we have to remember and I think that, okay, we have a pretty technical audience out there, but most people don’t understand that actually doing physics today, it’s mainly a data science problem. I mean, we saw the first image of black hole that’s mainly because like people were crunching a lot, a lot of data. A big part of this work was finding the right algorithms, the right processes to take this raw signal  and turn it into something that we can consume as humans and understand. And that’s exactly what LHC is doing. And the finance sector is doing a pretty good job attracting people. But I think there’s a lot of talent out there that still, if the industry figures out the right ways to do it, there’s going to be a lot of value to be driven from there without wanting to steal people from academia and all that stuff. Right. But yeah, that’s super, super interesting. I think we are at the end of our recording, Jason, I really, really enjoyed it. I think we can keep chatting for hours. We have many topics that we didn’t even touch. And I’m really looking forward to having another chat with you in the future.

Jason Haupt  50:43

Yeah, I really appreciate it for you guys. For me, it was really kind of fun to kind of talk about these things. So Eric and Kostas, it was really cool.

Eric Dodds  50:53

Yeah, we really appreciate you being on the show. It’s a treat for us anytime we get to talk about Kafka and particle physics in the same conversation, which, you know, not many people probably have that privilege. So thank you for joining us. Really excited about the work that you’re doing at Bind. And we’ll reach out again, in maybe six months or so to see how things are going.

Jason Haupt  51:17

I’ll be looking forward to it. Thanks a lot, guys.

Eric Dodds  51:21

Well, that was a fascinating conversation, I think one of the most fascinating, one of the most fascinating things I learned was that when you unpack a single health claim, you can get, you know, 11,000 columns. And just to hear about, I mean, we’ve heard other situations like this with guests on the show where you have something that sounds so simple, but when you actually try to do something valuable with the data that, you know, comes in a certain format or a certain size, it just creates all sorts of interesting complications. And of course, hearing about the scale of data that Jason’s worked with was fascinating to me, but Kostas what what stuck out to you, and what did you learn today?

Kostas Pardalis  52:01

That’s a great point, Eric, I think people I think people were spoiled by, you know, interacting with digital products. And we don’t really understand the complexity behind the technology itself. And we are also, let’s say, a little bit oblivious of how powerful of a processing machine, the human brain is, right? Like we consider something like a medical claim is something that we can process so quickly. But actually, like working with these and representing in a way that the machine can work with can become like something extremely complicated. So it was a great discussion to have and to communicate and help people understand the complexity of tasks that the data scientist or data analyst or a data engineer has to go through in order, like to ensure that like, value at the end is extracted and delivered to all of us. So that was great. I really enjoyed discussing more about the complexity of the data. And that’s, I think, it’s also like a benefit of discussing with someone who’s a data scientist, because a big part of the work of data scientists navigate this complexity and find out ways to compress this complexity. And of course, for me, it’s always a great pleasure to chat with people that are coming from the academic environment and the industry. Because these people usually are very, very passionate about the things that they do. And I think this is something that we also experienced today with Jason, as a person, that I am also passionate about data. So it was like a great pleasure to be discussing with someone who shares this passion. And I’m extremely happy that I also wanted to learn a few more things about projects like CERN, and how humanity is actually pushing forwards the state of the art when it comes to data and our understanding of the world in general. So I hope we will have an opportunity to chat with him again in the future. I think we have many more things to discuss.

Eric Dodds  53:56

Yeah, I think it was great. One other thing that was very interesting to me was how seamless it seems like the relationship is between data science and marketing. And that’s pretty unique, you know, even from a technical standpoint. And so, you know, hats off to Jason and the entire team at Bind for building something pretty special there, it seems like and we’ll look forward to catching them again on another episode of The Data Stack Show. Thanks for joining us and we’ll catch you next time.