Data Council Week (Ep 2): Testing and Observability Are Two Sides of the Same Coin With Ben Castleton of Great Expectations

April 26, 2022

Welcome to a special series of The Data Stack Show from Data Council Austin. This episode, Eric and Kostas chat with Ben Castleton, co-founder and Head of Partnerships at Superconductive, the team behind Great Expectations. During the episode, Ben discusses how Great Expectations came to be, how he became a believer in open source products, the difference between working in data versus healthcare, and more.

Notes:

Highlights from this week’s conversation include:

  • Ben’s background and career journey (2:13)
  • The birth of Great Expectations (5:02)
  • Defining software engineering (9:38)
  • Adopting open source products (13:04)
  • Working in data versus healthcare (18:01)
  • What’s next for Great Expectations (20:29)

 

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 0:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Welcome to The Data Stack Show. Still recording on-site at Data Council in Austin. We had a great conversation with Firebolt and the one we’re about to have is with a company called Great Expectations. Now, Kostas, this is what I’m interested in as far as Great Expectations: one, the name but two, it has really seen—of the data quality, data observability, variety of tools, the community and adoption—Great Expectations has is pretty impressive and I think that as an open-source project in that space, they’ve really had a ton of adoption. And so I’m interested to hear about the origin story, like why did they choose to open source it and how they’ve grown that community? How about you?

Kostas Pardalis 1:11
Yeah, absolutely. I mean, learning more about the communities, something that definitely I hope to happen like they are, they have a very vivid community that they’re one of these cases like the community that you have, like on TPT, like people are like obsessed with the technology. So yeah, I mean, I want to learn more about the technology itself, how the differentiates with the rest of like the data quality tools out there and sets out about the community and what it means to have an open-source dimensioned to a product that’s mainly does data quality. So I’m really looking forward to this conversation.

Eric Dodds 1:48
Alright, let’s dig in.

Kostas Pardalis 1:49
Let’s do it.

Eric Dodds 1:51
Ben, welcome to the show. I have been lurking sort of in the background, looking at Great Expectations for a long time so really fun to meet you hear that Data Council Austin and hear about the origin story. So thanks for giving us some time.

Ben Castleton 2:03
Yeah, no problem. Thank you.

Eric Dodds 2:05
Okay, so give us your background and tell us what led to starting Great Expectations.

Ben Castleton 2:13
My background is basically started as an accountant and then switched over into healthcare. When accounting became, I was in hedge funds, I was basically working to make sure that billionaires stayed billionaires. And I didn’t feel like that was doing anything good for the world. And I had a good friend in Boston at the time, who told me you got to get into healthcare, and data is where it’s at. So switched over doing analytics in data. And that led me to meet up with a job. And we realized there’s a lot of work to be done to help analytics in healthcare help more people and work faster. So this was a consulting firm, it was not a firm at all. So it wasn’t SaaS from the beginning. No, no, not at all. We were sort of like a tools, we tools enabled, like consulting. And so my background led to figuring out how can we sell consulting? How can we do data engineering for healthcare companies? Not where we started. But we had this meeting way back at the beginning where I remember us saying, yeah, it’s okay if you spend 5% of your time on Great Expectations because maybe that’ll help your career somehow. We’re not sure.

Eric Dodds 3:31
Google does like whatever 20% time or something and it’s like you get 5%.

Ben Castleton 3:35
Yeah, it was like, Okay, well, well, we’ll allow that as superconductive health, but it became clear in 2019, that Great Expectations had legs, it was taking off, there was a lot of demand cross-industry. And so we pivoted the company to just back Great Expectations and figure out how to take that to market just trying to figure out how to build some tooling that would enable us to actually create value at the organizations and already being deeply embedded in their teams and figuring out what are the problems that they’re really trying to solve? I know DBT has a great story there. And same thing, we had real problems that we were trying to solve with this little side project, and we would use it on early clients and then it started to take off on its own.

Eric Dodds 4:24
Okay, so it’s really interesting for me to hear that you were in the healthcare space doing work there because I wouldn’t think the natural like decision is we’re going to open-source this and really build an open-source ecosystem around this tool, right? Because healthcare is you just kind of think about, like protecting IP in healthcare. Tell us that story. Great Expectations has an unbelievable community around it. How did that come out of healthcare consulting?

Ben Castleton 5:01
So actually Great Expectations was started by cross-team collaboration with Abe and James, who was working with the NSA. They were sort of collaborating across organizations to figure out how can we solve some of these problems we’re seeing. So that was going on in parallel to us building up this healthcare consulting. Forgot it. That was the five-time Yeah, and you can go over there and do that thing. You’re doing a. And eventually, James came over and joined our team as we moved more towards getting great expectations out there. But James and Abe really started this together. And he’s James is our co-founder, but he was in a different company when he helped co founder what we’ve got going on. So get started cross-industry, and we’ve never had like a demand for it from like, specific industries. It’s always been demand from everywhere. And then we tried to use it in healthcare a little bit.

Eric Dodds 5:02
Yeah. Makes total sense.

Ben Castleton 5:25
Yeah, what about the name?

Oh, you stole my question.

Oh, that’s a good one. Well, first off, I would love the name Great Expectations, I think, I don’t know if that was a bad James together. But uh, definitely, Abe’s got his name all over it, loving Old English literature and Charles Dickens. And so the puns with pip install Great Expectations, just it’s endless, but so good. That basically the idea was that we want to build a shared standard for data quality, and do that out in the open, and figure out, figure out how we can validate and test if we’re getting what we expect from data at different points in the lifecycle. And then there are lots of different places you can go. But that’s the entry point into figuring out how to collaborate better around data and enable collaboration.

Kostas Pardalis 7:04
I have a model question later on that, but I’ll double weeks, many things are happening right now, like conveying this to Dr. Edmund Coleman is that they are in the quality data quality space, or like they love the rugby like, there are different adherence argues. What’s the difference? How do you see Great Expectations from play with what is happening with this category? I’ve been where we start in terms of the categories like I was thinking of the outro, we’re still like trying to figure it out.

Ben Castleton 7:32
I’m going to tell you that we figured it out. I’m mostly kidding here. But yes, there’s a lot of work to do in figuring out from an industry how the industry is going to play out. But as we think about the problem space, as opposed to going after it in a way where you’re sort of doing anomaly detection, or you’re just trying to see you’re observing data, we’re starting at it from the point where you say, well, we want to be able to test that data as it moves through a system is fit for the purpose that we want it to be fit for. And so in order to do that, you have to have people defining this is what we expect it to look like, and we don’t think you can ever get away from people. So when you talk about like human in the loop AI systems, where you have people involved that that’s more closely what we think it looks like, as opposed to AI coming in and solving everything and telling you what the problems are, you need to know, it’s more human in the loop systems that that sort of evolve with machine learning and work together to figure out how to make stuff faster and automate a lot of those pieces.

Kostas Pardalis 8:47
Yeah, makes total sense. So because of this, he doesn’t really like we love to boil with city terminology from software engineering. So software engineering, we have like unit tests, we have like integration tests, like it’s a much more mature like discipline when gap GraphQL think that right? What you would say that software engineering is closer to what Great Expectations is it like building unique tests, for example? Something similar to that. Is it something else like that’s, I mean, it’d be wise to get the rugby in the bowels Vedado, for example, like when it comes back to what it is. So what you would say is the closer pirating from software engineering to get to what Great Expectations.

Ben Castleton 9:34
Yeah, I’ve seen that question quite a few times. When we’ve talked about it internally, we would look at testing and observability as two sides of the same coin, that you can’t really split them apart and say, Okay, we’re doing this. So for us, you can’t get away from observability as something that you need. Like you’ve got to be able to kind of see what if I come into my data warehouse or let’s say I’ve got all my data And in S3 and running over here, we’ve got spark. And then we’ve got piping into the Data Warehouse. After that we’re using Jupyter notebooks to do some analysis, I want to be able to see everything and understand and understand where the problems are. And so yes, that’s important. But understanding the specific tests, and places where you can validate, that’s the other side of the coin that you can’t like, separate those out. So in our platform, we feel like you’ve got to build both of those to make sense, the testing, you can build individual tests, and that that would be a very manual and labor-intensive process to build all the tests that you want. And so we need to have machine coming in and say, Well, how can we get 80% of that automatically. And that’s where you get into kind of more smart tooling. And then also building observability into this, making sure that you can see that in an easy way from a central place, or making sure you’re alerting the right people that need to be alerted. So yeah, both sides of those we feel are really important.

Kostas Pardalis 11:06
And I’ll give software engineering and get in we probably like on sort of x CI/CD, either, where this thing happened compensated Right. Like we ran the tests, when they called is like for similar poster, a sample has been built like that this will ramp up of what yeah, all the things that software engineering is not about. What’s the process with data, because we don’t really have CI/CD, right? Like, without like something? I don’t know, we mean, literally, like get so nervous in the bike line of why create capturing, creating and consuming data exists where should we test it? As an as an as seriously?

Ben Castleton 11:47
First off, it is cool to see some companies actually going after that versioning of data, I love seeing that sort of action happening. Obviously, there’s a lot of work to do there. But as far as, as far as testing goes, where should it fit in same way you would do with software, we would say that before you release a model to production and start getting production results off it, you want to make sure it’s tested. And let’s say, in the same way software, you would say, Oh, well, I’m gonna commit, I’m gonna make a commit. And now I’m going to run my integration tests where I’ve got unit tests on that. And then we run that before we deploy, it’s kind of the same pattern with data, it’s just that we don’t have mature infrastructure around that process yet in the industry. But you’re starting to see a lot of those pieces get built out, especially like you see it in ML ops, you’ve got all this tooling that’s coming out there, we see a lot of that tooling as being built. And we are right in the middle of that, like you have to test before you deploy the same way you would with software.

Kostas Pardalis 12:55
Alright, so let’s look a little bit about open source. Whether it is with open source?

Ben Castleton 12:59
Yeah. So again, we were talking a little bit before this. And I mentioned, I might have been a skeptic a few years ago. And now I’m like, why would you ever build a company without having an open-source product?

Eric Dodds 13:11
Which is so interesting, right? To your average person, you say, Hey, we’re gonna build something, and we’re gonna give it away for free to the entire world, and then we’re going to build a business on it. And they kind of say like, okay.

Ben Castleton 13:24
The business is going a way. It’s not making sense. But I guess there are two things. One, I think, like, this is my personal belief. I think most people are good. And they want to do good things. And so this appeals to both the altruistic side of me and most of the people I work with, and the people I remember working with, they love doing something cool and giving it away. So that’s the one it actually appeals to, like, a side of us. That’s very personal, and we want to do something good and cool. And that feeds into like, how much excitement you get right? And then the other side is, well, if I want to get if I want to be deploying my product and get 1,000s of people using it, and eventually millions, like, what’s the fastest way to do that? It’s to build something that users from the bottoms up approach where the people who actually use the software can just get it for free. They can tell their friends about it, they can deploy it, they can share it building in ways that you can share it open source is fantastic for disseminating an idea and getting it out there in a way that if you have a paywall, it’s just much slower orders of magnitude slower.

Eric Dodds 14:44
Talk about the time a little bit. And I know we’re coming up on time here because you have a team dinner to get to. We’ll be respectful of that but talk about the time. Did you start out as open-source? Because I know you said, even maybe in the early days, you didn’t necessarily think that open source is like, this is the best decision we’ve ever made. How long did it take? Because there’s an adoption period, there’s sort of a validation period from a community standpoint. How did that plan to it?

Ben Castleton 15:17
At the very beginning, Abe and I had a conversation at one point where he was saying, if our company never makes money, I would still be really happy if the open-source project really got far and wide, and a lot of people used it. And understanding that, okay, there’s this other side, that we’re going to be happy to build a community and then build open-source, and then bringing it back now, where it’s like, well, even if we were making a lot of money, it would feel like a failure if the open-source project died, or we weren’t able to create something actually useful from companies. So there’s a commitment to open source that sort of supersedes the commitment to the business, but then the business like it’s, it’s really going to follow, there’s a lot of business value in having that open source community. So the timeline is really, okay, let’s put it out there. Let’s see what happens, we start to get a few 100 stars, people using it, we start to see deployments. And then it was really figuring out that we’re trying to build a shared language. So we need a community because a language cannot exist without a community to work or like grow or develop or to grow or develop. And so starting that community, and then starting to see the growth of that, that was really what kind of inspired us to realize, okay, this is how we can build a business around this. It was a couple of years before we really could see that. Sure. And at the beginning, it was a sort of a side project. But after a couple years, you see that? Yeah. And then we could tell, okay, we can build a business.

Eric Dodds 16:59
It’s easy to look back and say a couple of years, but we can think of experiences in our life where going through a several-year period of something doesn’t necessarily feel like just a couple of years when you’re in the middle of those years.

Ben Castleton 17:13
During those years, we did hire maybe one or a couple engineers and those of us on the consulting side were paying for them. We didn’t have investment, but it was super fun during those years.

Eric Dodds 17:26
Okay, well, we want to be respectful of your time. So I have one more question and then, Kostas, I’ll give you the last word here. So you went from making sure that billionaires stay billionaires, and so what is it like coming from that world and then maybe even the healthcare world wherein healthcare there’s probably like bureaucracy, things move slower? What’s it like now working for sort of a really modern open source company in the data space? Like, what are sort of the biggest things that you notice as differences?

Ben Castleton 18:01
For my personality, I needed to be in a smaller organization. So I really appreciated just being able to be with a group of people who get together and decide together, like, What’s the best thing to do here? Not what are you supposed to do? Not what does that report, say, I’m supposed to do? What not? What is this policy? But what should? What should we do? What’s the best thing to do? And so it feels really fun to do that, and then be around other people who just want to do that when I think I think the small startup really attracts those types of people. I also am kind of a risk junkie. So I just wanted to see if we could do it if we fail, okay. You know, sorry, we’re out of out some money. Take a hit on the salary. But let’s see if we can do this. And if we do, it’s really exciting. So that definitely resonated with me personally. But also, like, if you talk to a boss, he is he’s been kind of pretty vocal about being really concerned with how data is used. And is it ethical, like, are we doing things that are actually good in the world with data? And one of the cool things about Great Expectations is it kind of helps you make explicit some of the assumptions and the rules and the things that you’re expecting about data. And that has larger implications for like, Should we do this with our data, right, and making that explicit in documentation? And so it’s kind of fun to have some ethical purpose behind what you’re doing as well.

Eric Dodds 19:32
Yeah. Before you get the last word, Kostas, I just want to say I really appreciate that. It sounds like there’s an ethos inside of Great Expectations where you’re doing some like really interesting technical things but it’s very clear that there’s a culture where you see the larger picture and sort of operate according to a value system within that and I just really appreciate that.

Ben Castleton 19:54
Thank you. That means a lot at the end of the day. We’re all people here and we’re building some software, but we’re people build software. Thank you.

Kostas Pardalis 20:07
Yeah, that’s amazing. I think that’s like one thing, like having these kind of these dimension of a data stack in the company is what separates the two, when one is grateful one in stone, and we will get more when pursue like, what’s next for Great Expectations? So that’s my last question. Share with us something exciting that is coming in the near future.

Ben Castleton 20:29
There’s been so many like, I mean, we’re really excited about all these opportunities there are but a focus for us going forward is always going to be to invest in the community around Great Expectations and invest in the open-source, kind of build that up to be something that is super useful, not just for an individual to start to make some tests, but maybe an individual to put hundreds or 1000s of tests on a data warehouse really, really quickly, and be able to do that just with the open-source product, right. So there’s a lot more investment we can do to make it seamless to make it easy to use. And we’re not just going to save those for the commercial product, we’re going to do a lot of that in the open-source so that we can really feel good about, hey, we’re enabling data engineers to do something really powerful, just with the open-source product. And then obviously, it is exciting to see how we can deploy that in organizations at the enterprise level. And that’s going to involve collaborative workflows. So that’s my role. I’m personally excited to see us release a commercial product that can enable enterprises to do some good stuff with data quality.

Eric Dodds 21:45
All right, well, thank you so much. I think we’re gonna get you out the door on time for team dinner and we’re excited to talk with Abe tomorrow. That’s supposed to be a two-part episode, which will be really fun. But Ben, thanks for giving us some of your time.

Ben Castleton 21:56
Thank you so much. So good to be here.

Eric Dodds 21:58
What a fun conversation, I cannot wait to talk with the technical co-founder, a couple of things. It’s always amazing to hear the origin stories. There are a lot of similarities here with the DVT story, you sort of have a consultancy, and then technology coming out of it. And I think one of my takeaways I have two. The first one is it takes a lot of courage to be running a consultancy, and you can make a lot of money with a consultancy and do cool things, and they were in the healthcare space and that can have a really significant impact in a positive way. And to say, Okay, we’re gonna go, like, really invest in this open-source side project. I know, it takes a lot of courage, and I just have a huge amount of respect for teams that can do that. You look back now, and it’s like, oh, this is so cool. There’s a great community, right. But in the very beginning, that’s a very sort of, it can be a scary proposition. And then the other thing is, hats off to them for doing the pip install Great Expectations because that’s one of the cleverest tech company names I’ve ever heard of, it makes me smile every time I think about it. I want to install it just so I can type.

Kostas Pardalis 23:20
Yeah, yeah. Makes sense. Yeah, lets me know what they’re doing for sure, like on many different levels, like, on the product level, and the community level. And most importantly, what I want to keep from this conversation is like, the passion that the founders have about building a company, and the whole, like, let’s say, what it means to build a company outside of like, just the founders, right. And that’s exactly where like, it makes it so interesting. To see people obsess for months with the community, like, they don’t see the words that this company is doing, just like gaming, as a way to create value in a very monetary way. Like there are more things there. And I think that’s what I mean, as I said, like, during the conversation, this is what like, differentiates really good companies, with two great companies, what makes like a great company, but also like, it’s a huge, huge indicator of like the commitment of the founders have to make this happen. So I’m very happy that I must have this conversation connected with the Great Expectations people. And I’m, I’m really looking forward to see what’s next for them because they’re very creative. And I’m sure that we are going to be surprised outside of this and a bit more on the technical side. I love the fact that we see more and more of like best practices from software engineering and during like, work of working with data we discussed about like unit testing and how Great Expectations like are like related to that. So yeah, I mean Have another great conversation. And I think we should have more conversations with the Great Expectations, folks. There are other people in the tinder that I think should be on the show.

Eric Dodds 25:11
I agree. We’ll do it. All right. Several more great episodes coming at you from reporting on-site Data Council Austin. We’ll catch you in the next one.

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.