33: ML is a Data Quality Problem with Peter Gao from Aquarium Learning

On this week’s episode of The Data Stack Show, Eric and Kostas talk with Peter Gao, co-founder, and CEO at Aquarium Learning. A former engineer at Cruise Automation, Peter and Aquarium Learning help ML teams improve their model performances by improving their data.

Highlights from this week’s episode include:

  • How getting hit by a drunk driver made researching self-driving cars personal for Peter (2:12)
  • Filtering out the hype in self-driving car news to get a clear picture of its state today (6:52)
  • The data required for a self-driving vehicle (13:56)
  • Operation Vacation and how Aquarium can help provide the tools to make models better (16:53)
  • Utilizing neural networks to index data (20:41)
  • How Aquarium fits in the ML stack (30:25)
  • Interesting use cases of Aquarium (33:59)
  • Distinguishing subclasses of machine learning (40:05)
  • Human involvement in machine learning (46:13)

The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription

Eric Dodds  00:06

The Data Stack Show is brought to you by RudderStack, the complete customer data pipeline solution. Thanks for joining the show today. 

Eric Dodds  00:15

Welcome back to The Data Stack Show, we have another really interesting guest for you on the topic of machine learning, deep learning in particular. We’ll talk about neural networks. But it’s Peter from a company called Aquarium, and he’s the founder of the company. Really, really interesting stuff. My burning question, which is becoming more and more common, is actually not specifically about machine learning. But Peter has a background in self-driving cars. And I’m just so interested to ask him about being an early employee at Cruise, the self driving car company. And so that’s what I want to ask him about. Kostas, what’s your burning question?

Kostas Pardalis  00:56

Yeah, my question is going to be more around data. To be honest, data is a big thing in machine learning. Anyway, the algorithm is always completely useless without the right data. And also the company that Peter is building is around how we can provide better data at the end to train our models. The type of data that is usually used in machine learning is a bit different from what we usually use in analytics. So one hand you have more structured data for machine learning, you have more unstructured data. So it’s going to be very interesting to see and ask about the aspect of quality and how you work with data and how you annotate data, how you prepare data, in the context of machine learning, which is going to be quite different from what we have learned in the past with other guests that we had in the show.

Eric Dodds  01:42

Yeah, absolutely. I think it’ll be cool to hear about someone who’s building tooling for that, as opposed to maybe a practitioner. Let’s jump in, and get to know Peter.

Eric Dodds  01:54

Peter, welcome to the show. We’re so excited to have you as a guest on The Data Stack Show.

Peter Gao  02:00

Yeah, glad to be here.

Eric Dodds  02:02

Well, first of all, just give us a brief introduction. So our listeners know who you are. And tell us about Aquarium Learning at your company’s high level overview.

Peter Gao  02:12

Yeah, so I’m Peter. I’m the co-founder and CEO of Aquarium. And we basically build a ML data management system that makes it easier for teams to improve their ML models by improving the datasets that they’re trained on. Typically, the best way to improve your model performance is just to hold the model code fixed and kind of take a good look at the data set and find places where you can fix bad labels or add more of certain important data, and then that generally tends to produce the most efficient way to improve your model performance. So we make it easier for people to go through that workflow of looking at their data, finding problems, fixing those problems, and then re-train a better model.

Eric Dodds  02:51

Very cool. Well, our listeners know by now that I have tons of questions about that, that I’m going to do what I have fallen into the habit of doing. Of course, you know, we exchanged some communication beforehand, and I looked at your background, and one thing that we love to do for our listeners is just sort of learn where people came from, especially as it relates to sort of their journey with working with data. And you have a lot of experience in the self-driving car space, which is really fascinating. And, you know there’s just so many interesting things about dealing with data and I mean, even, you know, sort of society and economy. But I would love to ask you a couple of questions about that, if that’s okay.

Peter Gao  03:35

Yeah, sure. I also have to talk about my kind of lead up to that it was a bit of a long story. But I don’t know how much time you have.

Eric Dodds  03:43

Yeah, that’d be great. Actually, that’d be great. Because I’m going to dominate the conversation with a bunch of self-driving car questions for a minute. So if you could just give us the run up to that. I think that’d be awesome.

Peter Gao  03:52

Yeah, sure. So back in high school, I was actually on the robotics team, that was a big part of my life back then, became the captain of the team in senior year and then kind of went into college. And, you know, at the time, all the careers in robotics tend to be focused around defense. And I wasn’t super excited about that. So I went and worked in web for a little bit. So I did internships at Pinterest and Khan Academy. And you know, at that time, I kind of knew that machine learning was this really interesting field that had a lot of potential. And so I was working kind of like a mix of sort of the web stack and like normal web engineering, and then also integrating machine learning into that stack. So I worked on, you know, that at Khan Academy and Pinterest. At Pinterest, it was more for spam/fraud detection. And then for Khan Academy, it was actually for predicting the ability of students based on basically a diagnostic test that we would give to them once they onboarded onto the tool to give them the right content.

Peter Gao  04:50

So I had kind of like this background of robotics, and then also like a lot of exposure to sort of the web stack and data pipelines behind it. And then you know, when I was back in school, you know, my research was in deep learning for object detection. And I actually happened upon Cruise when, you know, that was kind of a sort of, I think, two-person YC company. And then I ended up interviewing when it was like around five people, and then decided to go there around like, you know, 18. I think I was the 18th employee. But it was personal for me, you know, in partially that, number one, it was kind of a way to like come back to doing robotics and combining that with like my machine learning and deep learning and trust that started to become more and more relevant for some of the perception tasks. And then the other thing was that I got hit by a drunk driver in college. And that was a pretty negative experience. And so I had a pretty personal stake. So and yeah, I like small companies. So it was a lot of fun.

Eric Dodds  05:49

Yeah, well, they’re not 18 people anymore.

Peter Gao  05:53

Like 2000 at this point.

Eric Dodds  05:58

And thank you for sharing such a personal story. And how cool that you later in your career, got to return to your love of robotics, and also combine other professional experience? I mean, that’s just such a neat opportunity.

Peter Gao  06:12

Yeah, definitely.

Eric Dodds  06:13

Okay, so I will indulge myself in a couple of questions, and then hand it over to Kostas. So this is less on the technical side, but self-driving cars have had an interesting hype cycle, you know, so they’ve gone through a test and the car passed, and this is the future, and then it kind of goes quiet for a while, and then something else pops up. But I’m just interested to know, from someone who worked so closely on it, what’s real, and what’s hype? And what’s your perspective on that, I just, I know, our listeners would be interested in that.

Peter Gao  06:52

So I think if you look at some of like, the really early self-driving stuff, like even those, like sort of corny videos from the 50s, about cars, that will drive themselves, you know, a lot of those sort of issue with those sorts of setups was that you had to essentially set up specialized infrastructure for self driving cars. You would have to put rails on the road or like magnets or something like that, so that the car would know where it was and where it needed to go. And so that’s kind of why it didn’t really work out until, like, you know, I think somewhere in like the 90s, early 2000s, there started to be kind of a resurgence in interest in self-driving. And, you know, obviously, with the sort of like DARPA Grand Challenge and Urban Challenge that led into Waymo, and stuff like that, I think it was just kind of the demonstration that the sensors, the compute technology, the sort of just hardware and software stack had evolved to the point where you could actually do perception in a relatively workable way, which is, you know, being able to interpret the world through the sensors, rather than having to build specialized infrastructure for these vehicles for these robots to work.

Peter Gao  07:58

And so I think, you know, the sort of critical piece that started to make self driving really get a lot closer to reality was, you know, the DARPA Urban Challenge was like, somewhere in like, 2008, 2009. Deep learning came onto the scene around 2012. And with deep learning, you suddenly had this sort of technology, that just performed really well for a lot of perception tasks that were relevant for not only like sort of contrived problems like Imagenet, but also for like real world robotics problems, like traffic light detection, and object detection, and classification, and things like that. And that kind of provided this boost of performance. It made it really workable, to have systems that could reliably and safely operate in the real world without too many modifications to the infrastructure.

Peter Gao  08:51

So you know, at least now with self driving, you can go to Phoenix and if you have the Lyft app, you know, you can call a Waymo car, and they will pick you up, and it will send you places without a driver, and it will be safer than a human. You know, that is the state of the technology right now. And I think that part is underhyped. The part that is overhyped is this idea that, you know, it’s gonna be everywhere tomorrow. And so like, when you look at these sort of systems that are very complex, that are very safety critical, you need to basically recertify them and re-adapt them to every sort of new domain that you want to deploy them into. So moving from, you know, Phoenix to San Francisco, where you have more of an urban environment, moving from San Francisco to New York, where you have more like weather conditions like snow, or like, you know, sort of unique driving behavior. That all takes like a certain amount of time. And a lot of that time actually comes into sort of re-adapting these deep learning models or a lot of like the code inside of the sort of normal robotics stack for these new conditions. But the sort of interesting thing about self driving is that it’s kind of this big sector where so much technology was developed. So much money went into it. So many great people got together, that it showed that real life robotics, and applied deep learning was something that worked, as long as it was put into the correct sort of system specifications. And it’s something where it is super valuable in a lot of industries that are not self-driving in a lot of cases are a lot simpler, and a lot easier and just as economically impactful. And that’s kind of like the customer base that we actually serve over at Aquarium. So yeah. 

Eric Dodds  10:30

Wow. Yeah, it’s almost like the hype happened too early, where it was like, this is gonna change the world. And it was like, okay, it’s actually going to take way longer. But then when you say something, like, you can just download a consumer mobile app in Phoenix and get picked up by a driverless car and have, you know, a way safer experience than you can have under any other circumstance. I mean, it’s like, okay, the future is here. Like, that’s insane, that anyone can go do that in Phoenix.

Peter Gao  10:56

I think Neil Stevenson says it best when you know, the future is here, but it’s just not very evenly distributed. Yeah. And I think that’s kind of the real reason why people are underwhelmed with it, you know, it’s here, but it’s not everywhere.

Eric Dodds  11:09

Yeah, I mean, that’s also, you know, AI, we’ve had several conversations around this with AI, we’re like, AI branding is the craziest thing, because it’s like, some people fear it, some people, you know, deny it, you know, and it’s sort of one of those things where, like, oh, AI and self-driving cars are gonna change the world. And it’s like, well, all you can do is drive cars around Phoenix. I mean, that’s not very cool. *Laughter*

Peter Gao  11:28

Yeah, and something, you know, like, I want to, like sort of emphasize is that, like, when we were starting off at Cruz, you know, back in, like, 2015, at that point, the state of the sort of tool chain around machine learning, or specifically, deep learning was just terrible, it was non-existent. You know, I worked on a project at Berkeley called Cafe, which was the sort of first deep learning framework after Alex Net. And at that point of time, you know, the maintainers, for it, were these three graduate students who were simultaneously trying to complete their PhDs, and also maintain this, like open source repo that was being used by thousands of thousands of people. And of course, you know, like, one of them is going to take priority over the others. And so like, when I got to Cruise, you know, we had to build all this stuff from scratch, because there’s a lot of parts of that sort of ML workflow, and you have to build like tools to make all of that easier. And we had to build all of it from scratch, you know, back in the day.

Peter Gao  12:20

And now you look around, and there’s so much great stuff out there, that covers so many different parts of the stack. And so now it is easier than ever to get something working, you know, it’s easier than ever to get like an MVP that functions that sort of like 80% accuracy, that you can present your boss and be like, yo, we should invest more into this. But the part that doesn’t have as much tooling, and doesn’t have as much focus is the part around making this MVP work in production on a large variety of circumstances and acceptable accuracy. And that’s where Aquarium really focuses on– basically taking a lot of the learnings that the self-driving field has already sort of grappled with and in large extent, like already solved, and helping a lot of all these other people who are working on machine learning, working on deep learning and trying to make their models adapt to the sort of, you know, circumstances they see in the real world, and make them better and iterate on these models over time.

Eric Dodds  13:16

Yeah. Well, that’s a perfect segue for me to end my monopoly on the conversation and Kostas I know, just from chatting with you today, you have a bunch of burning questions about Aquarium. So Kostas, take it away, I will stop monopolizing the conversation.

Kostas Pardalis  13:32

Thank you Eric. Thank you so much. So Peter, I have quite a few questions about Aquarium. But before we go there, quick question about your experience with self-driving cars, and more specifically with data. So can you give us an idea of what kind of data you’re using when you’re trying to build all these different systems that enable a self-driving car?

Peter Gao  13:56

Yeah, so if you think about a self-driving car, kind of as, like this rough, you know, software block diagram, you know, on one side of this block diagram is sensor input. So this is stuff like LIDAR, this is stuff like cameras, this is stuff like accelerometer, data, GPS, all that stuff. And then out the other end comes steering actions, you know, like, you know, the, the accelerator, you know, set it to 80%, or break a little bit or like steer left to this extent. And then, you know, you look at the hardware stack. And of course, like, there’s a lot more stuff that’s going on over there. But when you look at the kind of data that you’re handling on the input side, a lot of it deals with sort of the sensor data that the car basically is capturing as it drives around in the world, as well as kind of the more you know, essentially consumer facing inputs, like I would like to get picked up here and I would like to go there, and you know, what car is going to be the one that’s taking me in the sort of command and control aspect of it. So there’s a lot of data, but you know, I’m happy to go into specific aspects of it if you’re interested.

Kostas Pardalis  14:59

Yeah, yeah. Sure. So can you share with us a little bit more of your experience? Like with working with this data? You mentioned earlier that back then when you started at Cruise, like you didn’t have the, not even the toolset was there, right? So how was a typical day of working with data, there at Cruise?

Peter Gao  15:17

So I think the biggest distinction between a lot of these robotics use cases and sort of the more traditional web use cases that people are used to is that if you look at a site, like Google or Facebook, or something like that, most of the data that is being generated is “structured data.” You know, these are sort of like tabular data, things that can be described in sort of like a SQL database or an Excel spreadsheet. You know, in a robotics application with all these sensors, the vast majority of that is being you know, of that data that’s being generated, is all unstructured data, it’s primarily imagery, it is primarily point clouds. They are things that are essentially really hard to index with traditional data stores that were developed for what? And so like, you have this big problem, where, you know, these vehicles are generating terabytes and terabytes of imagery per hour. And now you have to figure out what to do with it, you know, how do you store it? What do you like, basically, index and not index, and, you know, from the perspective of a machine learning practitioner, like what do you train your model on? And that ends up being kind of this huge problem, not only on like, you know, the piping of where to store things and how to process it, but also just in terms of like, the sort of workflow and the intelligence on top of it, like, you know, what do you do to make this stream of just, you know, massive, you know, onrush of data into something that solves your problem for you.

Kostas Pardalis  16:42

Super interesting, and how did your experience there drive you to create Aquarium today? What are the challenges that you had and that you’re trying to address with Aquarium today?

Peter Gao  16:53

So in self driving, of course, like, you know, you have these very stringent requirements on your sort of system performance, and your sort of machine learning model performance. And so a lot of the work that we did there was around making sure that we could improve our models consistently over time. And the sort of open secret that a lot of applied machine learning practitioners have come to is that most of the practical gains in your machine learning model performance come from improvements to the data it’s trained on.

Peter Gao  17:21

So what does this mean? If you’re just getting started, and you look at your labels, and you look at your data, and a lot of it is incorrect, you know, either it’s mislabeled or it’s corrupted or something like that, like, you shouldn’t expect your model to be any good. You should probably clean up the bad data and the bad labels. And, you know, train it on clean data, and of course, you’re gonna have like much better results, and it’s just such low hanging fruit, that it is kind of just the easiest thing to look at. And then, you know, the flip side is also that when you look at the failures of your model, you know, these are not necessarily things that can be tackled with sort of like PhD level changes to the model code. A lot of the times and sort of cases where you need to just go and collect more data of a certain difficult scenario. And so in the self driving use case, you know, the sort of common example, I like to use that, let’s say you’ve trained a cone detector, you know, it takes your model takes in an image, it says, Okay, here’s a cone or not a cone or something like that. And if you train this sort of model, first off, it tends to do really badly on green cones. And you’re like, oh, what’s going on? Like, you know, why isn’t it at 100% accuracy, and you look into it, and you realize it’s not doing well on the green cones, because all of the cones that you trained on are orange. So this model has never seen green cones before, it doesn’t know what to do with these green cones. And the solution here is, you should go find more pictures of green cones and collect them back and label them and retrain your model on it, and it will start to handle green cones.

Peter Gao  18:50

And so that sort of process of understanding the failure modes, addressing them with the proper data curation, and then making sure that the retrain model is better than the previous one that you had. That takes a lot of time if you don’t have good tooling for it. And so if you have good tooling for it, not only can this iterative process be really fast and really reliable, and producing your, you know, improvements to your model, but it’s something where you can essentially take the ML engineer out of the equation where they don’t need to be there every single day hand-holding this like machine learning pipeline from end to end. Instead, this is something where you can just have a sort of domain expert who understands like what is a cone or not a cone or, you know, what is good, what is bad, and have them click around in a user interface and essentially improve the dataset and improve the model.

Peter Gao  19:44

So this is known in you know, like Andre Karpov, he talks about this a lot as Operation Vacation, but specifically, it’s a way to reliably improve the model without needing extremely skilled labor, just from looking at the data. So when we look at things like, you know what we work with at Aquarium, there’s just a lot of people who have the same sort of problem where they’re trying to get a model to work in production. And we’re basically giving them the tools to do this sort of same iteration cycle to make their models better and work in production.

Kostas Pardalis  20:16

Yeah, makes total sense. So can you describe to us how you can put structure to this unstructured data? Because my assumption is, and please correct me if I’m wrong on that. But I assume that like, the first step is to take this unstructured data and create some structure out of it, right? Like create some metadata, or these labels that you mentioned that then you use for organizing the work around, like model training. So how is this done?

Peter Gao  20:40

So the naive way that people try to build structure around data tends to come from basically assigning metadata on top of it. So this is stuff that, you know, for example, you can put timestamps associated to when a piece of audio was captured, or you can, you know, get something about, like, you know, who was the speaker inside of this audio. You can say which device it was captured from, you know, like that sort of stuff are basically convenience splits, to be able to sort of index your data in the same way that you would with structured data, right. But this has like a lot of limitations, because that means that if you want to capture any sort of variation in the underlying data, you have to pull it out into metadata. So either you are like, you know, having humans annotate all this stuff, or, you know, you have some way of automatically capturing all of the variation in the underlying data. And that’s just not practical in the vast majority of use cases. 

Peter Gao  21:37

So really, the magic of what we do with Aquarium is that we rely on neural networks to index the data for us. So a neural network, basically, when you run it on a piece of data, you can extract out an activation, a layer of the middle of the neural network, and produce this thing known as an embedding. And this embedding is kind of like this vector, that is what the neural network thought about this data point of this audio, or imagery or whatever. And then you can actually compare these vectors to each other and find, okay, here’s actually a cluster of very similar data, or here’s an outlier in the data set. And so this neural network is essentially extracting structure out of this, you know, very messy input data. And by relying on this, you know, sort of aspect of neural networks, we can actually tell you what is in your data set, what is the distribution of your data set, what is the variation in your data set, and you can start to uncover patterns in your model’s performance. Like, here’s like a little cluster of green cones that your model consistently fails on. And then we can also do things like search within unlabeled data to find more examples of these green cones that you can therefore collect and bring back and label instead of having to go look through a spreadsheet of a million images and click on one link at a time to find the piece of data that you’re looking for.

Kostas Pardalis  22:59

Oh, that’s fascinating, actually. And I assume that still, I mean, you go through creating these embeddings, which create, like, extract some kind of structure out of your data. Is there semantic information around that? Or this is something that still a human has to do? I assume that the neural network is not capable of like figuring out, Oh, in this image, you have green cones or cones, right? Like the concept can be just the cone doesn’t matter about the color? Or is it also possible to do that? How does it work and how does it work together with a human operator? 

Peter Gao  23:34

Yeah, so this is kind of known as like unsupervised learning in machine learning literature. And so roughly what that means in practice is that your embeddings are producing like clusters in your imagery and your data and your like unstructured sort of input. And then a human operator can instead of looking through like a million, sort of just, you know, flat examples of data points, they can just focus on the clusters. And you can look at each cluster and have a sense that, okay, this is all like, you know, the same red truck, or this is all the same, like, you know, green can. And then also you have this notion of similarity, where like, okay, here’s a section that is green cans versus red cans, or here’s a section where you have like a very boxy object. And the cool thing about this is that these activations, these embeddings, can be extracted from pretty much any neural network. And so if you are trying to train a machine learning model on your data to do a task, then you can basically extract out these embeddings as a byproduct of your training process, and they will produce these really good clusterings. And so for the human operator, really, what it does is help them kind of look at this massive pool of data. And it distills it down into these clusters or patterns that they can more easily look through and understand what is going on.

Kostas Pardalis  24:56

But isn’t there a kind of like chicken/egg problem the way that you describe it? I mean, if I have to train first the neural networks to create the embeddings, I need to have some data, right. But in order to do that, I have to annotate the data. So how does this work in real life?

Peter Gao  25:14

So when someone is getting started with the machine learning task, usually what you do is you can actually take a pre-trained model. So there’s a lot of neural networks that are trained on just general imagery, or general audio and stuff like that. And those models can be used to extract embeddings on data that they have never seen before. So if you’re trying to just collect a set of data that you’re starting from scratch, and therefore you can’t train your own model, then you can actually use these pre-trained models to generate embeddings. And to kind of organize this data upfront for you. And of course, the embeddings will not be super great, you know, sometimes they’re going to look for similarity in things that you particularly don’t care about for your task. But then once you’ve sort of bootstrapped a set of data that you can now train a model on, then you can basically go and train your own model on that data and extract your own embeddings from it, and then now you have kind of this set of embeddings that is pretty well attuned to your task. And, to some extent, also, like, you know, what these embeddings allow you to do is to, for example, uncover like, here’s like a pattern of failure cases, right? Like, you know, we can go tell you like, here’s like a section of the dataset where there’s an edge case you haven’t seen before, but there also sometimes has to be interaction with human labelers. Where, okay, like, you know, the model thinks here is like, you know, a set of data of a certain type, but you will also want to send it to a human workforce for them to basically check it, to QA it, to label it into a form that you know is clean, and that you’re comfortable retraining your model on. So I think it’s not as much sort of automation in the sense of replacing the human as much as it’s kind of an interactive feedback loop between the human and the machine in order to produce a better model.

Kostas Pardalis  26:59

Yeah, absolutely. And I think this is a kind of pattern that it’s very recurrent when it comes to machine learning and AI. And that’s something that we have discussed also with other people here who are coming from this space. And it’s very interesting to see that, at the end, what is happening is like these synergies that are created between technology in humans and how technology augments actually, the human and vice versa, right, because at the end, the model, I mean, it’s a black box that takes information that has been curated by a human to learn and provide things that are results that are relevant to humans, right. So I think that’s very interesting. And I think that’s something that should be heard more often out there, because you have, like, you know, like all these people who are like, Oh, yeah, AI’s going to be, you know, like, a post-apocalyptic situation with Terminator and stuff like that. And, yeah, I think it’s very important to hear that from experts like you.

Peter Gao  27:53

And you know, like to give you an example of kind of where this, you know, happens inside of our product, one of the things that we do with Aquarium is to surface the places in your dataset where your model disagrees the most with your labels, with your data. And we surface this to a human user, and basically show them these examples and ask, like, is the problem here that the model is wrong, that it is making the mistake on these green cones or whatnot? Or is this a case where the labels are wrong, where like, for example, like a label is missing, or it’s missannotated, or the data is corrupted. And so ultimately, the human has to be the person who judges that, right, that cannot be resolved automatically. The human has to kind of give their intent of what they want this model to do. It’s kind of like training a co-worker to do a task. Basically you have hired this new person, you give them some instructions on how to do their job. And then they do their job for a little bit, you inspect their work, and you’re saying, Okay, this is stuff that you did well, this is the stuff that you did incorrectly, and you should change up for next time. And by giving them that feedback, you know they perform better in their job. And it’s the same way with these sort of AI models with these deep learning models, where essentially, the human is there to give feedback of what they want that model to do. And the problem with the field right now is that when you’re communicating with a human coworker, you can, you know, speak in whatever human language you’d like. But when you’re trying to, like, you know, work with a model, it’s like, you have to give us feedback by, you know, communicating in Morse code with sticks, that you’re banging on a rock. And it’s really hard. And it’s really difficult. And yeah, that’s why we’re trying to build this tooling to make it easier to interact and iterate with these machine learning models to get what you want.

Kostas Pardalis  29:43

So let’s talk a little bit more about the product itself and how it works. From what I understand we are talking about an interactive process here working around the data and curating the data. In order to do that, its one part is of course, the curation itself, creating the labels, surfacing the embeddings and representing these to the users, but then whatever you do there has to be applied to retraining your models, right, and then go back and do the same again. So how does Aquarium work, and how does it operate with the process of training a model, the rest of the infrastructure that like a machine learning engineer is using today, and how it fits in the overall, let’s say, ML stack.

Peter Gao  30:25

Yeah. So I think like, if you were to look at the ML stack, let’s say that you’re tackling a problem. You’re tackling a problem, like, you know, in an NLP. Let’s say you’re doing something like, you know, named entity recognition or something like, you know, when you have your problem, now, you can sort of work back towards, like, what tools you need. You know, like, maybe you need a labeling service to you know, annotate these things into the text, you need some sort of way to train your model and in a distributed way, really quickly, you need something for your experiment management, you need something for deployment, you need something for monitoring. And there’s a lot of different components in the stack. And so some sort of, you know, some approaches by other vendors are to essentially offer all the stuff in the pipeline in one complete package. And what we’ve seen from working with a lot of machine learning teams is that, you know, if you were to consider the analogy to web, that would be like you go to like Squarespace, and you, you know, click around in Squarespace. And now you can set up this very, like, you know, nice, but very basic website. But as soon as you want to do something more complex, you know, if you want to build Facebook, you kind of instead go towards this model, where you string together a lot of different tools into a tool chain that works out for you. So you combine Century with, you know, like a Django server on top of AWS, you do all sorts of stuff, where the engineer essentially is like, putting together these pipes to create a cohesive product. And with Aquarium, you know, we see that is kind of like the mode that a lot of serious machine learning teams are moving towards where they kind of stitch together a lot of different tools that are best in class for like, you know, training or for labeling or for deployment or whatever. Aquarium basically sits on top of that. And it’s kind of like a workflow layer. So what we do is integrate into whatever sort of data stores or labeling provider or like model type of training system that you have, and kind of give you this high level overview of, okay, this is what your data set looks like, here’s what your model performance looks like, here are the places where they disagree. And here’s an engine in which you can basically understand where the failures are happening, triage them, and then take resolutions on them by for example, identifying, okay, here’s like a section that you’re not doing very well on, and then helping you collect more of that data within the app, and then sending it off to a labeling provider to be annotated. And then basically, you know, triggering a training run after that. And then allowing you to compare the difference between your new and your old model within Aquarium. And so we kind of are this interesting mix between like JIRA, and Century for the machine learning stack, where we kind of sit on top of whatever infrastructure people already have built internally. And we are more just telling them, Hey, this is the thing that you need to do next, to make your model better, helping them do that by dispatching tasks to different tools and parts of the pipeline that they’ve already built. And helping them basically take actions to improve their model, not necessarily in Aquarium, but through Aquarium.

Kostas Pardalis  33:30

It’s interesting. And can you give us an interesting story around one of your customers or like something with the data that happened there? The reason I’m asking this question is because anything that has to do with ML and the actual work itself, behind ML, it’s something that it’s a little bit of back to the people out there. Everyone sees or thinks about the magic that happens, like we have a self-driving car, right? But can you help us a little bit understand the work that is done behind that? 

Peter Gao  33:59

Yeah, so I can actually give you a few because I think, the favorite part of my job, and the reason I left Cruise to start Aquarium, is that there are so many really interesting, awesome problem domains that people are applying deep learning to. And they’re really fun and useful and just unexpected, honestly. Like some of our customers are doing deep learning on trash. And it turns out that that’s a very lucrative industry to be analyzing what people are recycling or what food people are throwing away and getting insights to, for example, the recycling center to know how to sort different, you know, pieces of recycling or to the kitchen owner to decide what food they need to make less of, and that is like something I never would have thought of. And like yeah, there’s other people who are working on agriculture. There’s other people who are working on logistics and drones and like, so many, like just disparate places, like you know, surveillance and like, you know, industrial inspection and stuff like that, and it’s so fun, just like getting to know all these people who are doing such interesting stuff and helping them really.

Peter Gao  35:05

And so I can tell you about one of our customers that we wrote a case study on, it’s called Sterblue and they’re a company based out of Europe. And what they do is basically, they have a stack that allows you to input like drone or aerial imagery for inspection of critical infrastructure. So this is like, you know, wind turbines, power lines, cooling towers and power plants. And the way that people used to inspect this stuff was literally you get a ladder, or you get some climbing hooks, and you climb this pole, up this power line, and you go, and you look to make sure that there’s no like corrosion, or like, you know, damage or whatever to the power line. And of course, this is something that is like, very time consuming, very expensive, you’re going to miss a lot of stuff. And it’s dangerous, because you’re sending like a person to climb up this power line. And so what sturbridge does is they take the symmetry, and they analyze it, you know, number one, using aerial imagery, instead of requiring someone to climb up physically. And then number two, being able to inspect the symmetry with a combination of human experts and deep learning models to find defects and surface them to the sort of owner of like the grid or something like that, so that they can direct maintenance towards it. And of course, like, you know, the advantage of this model is that you can go and just inspect way more stuff, way more efficiently, and catch a lot more problems before they happen. And so for them, you know, they’ve trained this deep learning model that is kind of working in concert with a team of experts, and they wanted to make this model better, in order to be able to handle just more miles of power lines more efficiently, without needing to rely on like, this very limited pool of human experts who are going to take quite a while to get through all of that data. And so we help them look at their model, and they realize like, okay, like, you know, where is our model doing badly. And they realize that most of the problems were actually just with the data. You know, there are cases where there’s like one or two labelers that they were working with, who were kind of consistently making mistakes, and they were able to go find that and catch that and give sort of like corrective feedback to the labelers so they could produce good data. And in certain cases, you know, like, sort of in their legacy sort of way that they were doing data labeling, they were using a different standard, they were drawing these very large polygons, on top of certain defects, instead of very tight polygons around the actual area where the defect was occurring, you know, instead of drawing like a polygon around, like, you know, a hole in the wood, they were drawing it on the entire, you know, like power pole line. And so we help them uncover like, this is the issue. And this is why your, your models kind of outputting weird stuff. And they’re like, oh, wow, yeah, okay, that makes sense. And they were able to go back, and they were able to do a pass through their labels and fix them to, you know, adhere to this common standard of like, you know, small polygons around the actual defect. And when they re-train the model, it got like, 13%, better, and that was like, a week of work. And it was just such low-hanging fruit that they didn’t really know was even there until they looked. And then, you know, based on that, they were able to cover hundreds of miles of power lines a lot more quickly, they were able to cut the sort of requirements of their human experts in half and cut the labeling costs in half. And, you know, of course, they made their customers so much happier. You know, I can tell you another story about a customer that we worked with in industrial inspection, I can tell you some stories from like, different sort of customers and different sort of domains, like one of them in agriculture. But you know, it’s something where, number one, I think it’s just so great. There’s all these different exciting applications. But number two, I’m also surprised that the same playbook works extremely well across all these different applications. It’s something that you wouldn’t think it was something that was going to be a common way to improve all of them. But the magic of deep learning, you can apply this repeatable playbook to a fairly common set of models and achieve the same great results.

Kostas Pardalis  39:08

Yeah, yeah. I think it’s amazing how many different use cases are out there where deep learning is used, and people just have no idea about it. We all focus, you know, on what you hear about like self-driving cars, mainly to be honest, and anything that has to do with surveillance. But it’s, it’s amazing. And I think it’s something important for people to hear all these different, amazing use cases that are out there that they don’t just, as you said, they don’t just reduce costs. In some cases, they also save lives, right? Because climbing on these poles and trying to figure out if there’s a defect there. It’s a dangerous job. It’s not something that is easy to do. So Peter, two last questions for me, and then I’ll let Eric continue with his questions. But first of all, it’s about the data again, can you give us a sense of what is the most commonly used data in machine learning today?

Peter Gao  40:05

So I think it’s critical to sort of distinguish that there’s a lot of, you know, subclasses of machine learning. So, you know, if we were to go back to kind of like the late 90s and early 2000s, machine learning had been very successfully deployed in a lot of web applications for things like recommendations, or forecasting, or predictions and things like that. So this is like, you know, if you’re on Google, and you’re clicking around, you know, what do you recommend to the top of the list, or if you’re trying to forecast what is like your future revenue based on your previous revenue, or something like that, you know, these are problems that are relatively well understood, and have been applied in a lot of use cases successfully in the early 2000s. And a lot of this is because it’s something where you’re kind of getting the data for free from user actions on your site, or from just like, you know, seeing, you know, like the present versus the past, and then trying to predict the future. And all this data tended to be kind of like tabular data, like recommendations, and ads targeting and price forecasting, that’s all like, you know, stuff that you can put into a SQL database or a spreadsheet. So this is like a class of data that is still I think, extremely prevalent and extremely, you know, value adding, and it’s very common. Now, the sort of data that we deal with, in our line of work with deep learning tends to be more like unstructured data. So this is, you know, a lot of people who are dealing with imagery, a lot of people who are dealing with audio, and NLP sort of text use cases. And then some people dealing with, for example, 3D point clouds that are generated from LIDARs or like CAD models and things like that. But in this sort of new wave of deep learning as a subset of machine learning, there’s kind of more of an emphasis on unstructured data. And that unstructured data tends to have a lot more interaction with the real world, with the messiness of the real world as well, instead of just like the tabular sort of clean, isolated nature of like, actions within a web app or something. And I think, you know, beyond that, the thing about this paradigm of working with unstructured data is that the data does not come for free. So instead of this being something where it’s kind of like a prediction or forecasting problem, where you kind of are just like trying to like refine your guesses of what people will do in the future based on what they did in the past, now, you’re kind of doing something that is more like automation, in terms of your workflow, where you’re trying to get humans to do a certain task for you. And sometimes you have to pay them to do labeling or bounding boxes, or whatnot. And then you’re essentially telling your model to try and not only imitate it, but also to generalize from that set of data to data it’s never seen before. And so this sort of new model of doing deep learning, and the requirements around it leads to a pretty different workflow. So I think some parts are common, like a lot of the stuff that you need to use for crunching data and moving it from place to place is definitely in common. But then there’s a lot of differences in particular for like, you know, the fact that you’re using a deep learning model that has to be trained on GPUs at scale, or the fact that now you have to annotate data, or the fact that now you have to kind of think about, like, you know, what is the right data to annotate? Which, you know, we think a lot about,

Kostas Pardalis  43:25

Do you see any use case, like today, or like, do you expect to see something in the future where deep learning can be used with more structured data? 

Peter Gao  43:34

Yeah, actually, what we’re seeing right now is that some groups are actually using deep learning on structured data in really interesting ways. And there’s a lot of sort of, like graph convolutional models. I think if you were to look at some of the more advanced groups, you know, I think internal to Google and Facebook, they’re already moving over to deep learning models. I think if you were to look at, for example, Instacart. Instacart has surprisingly done a lot of stuff with deep learning for basically predicting what people should pick up inside of grocery stores, and in what order and stuff like that, which is really fascinating. And I think the reason why it hasn’t been as widespread so far is just because people kind of have been using sort of old school non-deep learning models for quite a while. And there’s a lot of inertia that carries over from that, especially since it performs pretty well. But then it gets to the point where when you start tackling really complex problems, or when you have really, really, really big data sets that’s the point where it’s sort of like the new age of deep learning models offers way better performance. 

Kostas Pardalis  44:38

Yeah, that’s super, super interesting. Eric, the stage is yours.

Eric Dodds  44:43

I have learned an incredible amount. And I also have to say, I’m sure our listeners, at least some of them felt the same way, but when you gave the trash example, I had to take a minute and think okay, what if someone’s running deep learning on my trash? What is it saying about me? I got this moment of like that is so crazy. Superduper interesting. Two questions, because I know we’re getting close to time here. One, and just interested in your perspective, because you’ve kind of seen the data tooling and data workflows come of age in a way. One thing that’s really interesting to me, as I think about, what we’ve learned on the show over so many episodes is that even when we’re talking about really advanced technologies, there still seems to be, I guess, if you just, if you break it down, not I would say like unexpected, but especially to me, who doesn’t have a background in technology, a surprising amount of manual work, that still goes into some of this stuff. And I think about Aquarium specifically, where there’s all sorts of value that I think about just the workflows, pre-Aquarium, and post-Aquarium as you’ve described them. And it’s amazing how much just effort and work that it saves in automation, and I guess, you know, living in an age where we have self driving cars, it’s surprising to me, and I just loved your perspective on that, because it seems to be still so pervasive, even though we’re using really, really advanced tools.

Peter Gao  46:13

Yeah, you know, like, one of the examples that I like to raise up when we talk about this is that before Aquarium, a lot of the times the sort of standard of tooling for people is like spreadsheets, or Jupyter notebooks, or like, you know, I remember working on a project where our visualizer, and our dataset organization system was Mac Preview, and it was a bunch of folders with images in them on my local hard drive that we were labeling. You know, and this is just like the reality that it’s just still really hard to work with data. And the sort of paradigm that, you know, machine learning is about, you know, like, if you look at the way that we write code, there’s so many great tools out there for debugging your code, profiling your code, understanding what is going on in your code. And, you know, with data, it’s still something where like, people are kind of waking up to the fact that you need tools to make that process of understanding and improvement much easier. And so I think, there’s always going to be some amount of labor, or at least in the next, like, you know, four years or so because someone has got to say, to the machine learning model, this is what I want, right? It’s not something where you can necessarily write a spec as a product manager of exactly what type of attributes are required to classify something as a cat. Like, it’s hard to write that down. And so the sort of process of working on machine learning tends to be a lot more iterative, where you kind of give it examples of, you know, cats, and then you see where it fails, and then you kind of correct it, and then, you know, continue going on that front. So I think there’s going to always be some amount of like human involvement on that front. But then so much of the sort of unnecessary toil that happens right now in machine learning is about trying to make sense of like, I have millions and millions of data points. And whereas the section of this data set that I need to focus my human attention on that is most important for improving this model.

Eric Dodds  48:18

Totally, I think that’s such an elegant way to put it where you … another subject that we’ve talked about on the show a bunch, but the human involvement in, in machine learning is so critical, but unnecessary toil is, I think, a much better term to describe what I was talking about, you know, that just all seems so pervasive, and how cool that you’re building tools to help solve that. One last question here. I won’t make that promise. Because I’m horrible at keeping it. But maybe one last question is, when it comes to data, for machine learning applications, you sort of have this interesting issue of critical mass, right? So it becomes valuable when you have enough data to train models. And, you know, you sort of have enough inputs to make it really valuable when producing the outputs. And you work across such a wide variety of different industries with your customers. I’m just interested to know your thoughts on the threshold there. Because I know that there are a lot of companies out there who maybe run like a really tight ship on sort of their general data practice and are wanting to explore machine learning. What’s the critical mass? And does that vary in terms of types of data? I’d just be interested in your perspective on I guess, what’s the, the low watermark as far as a threshold and sort of quantity and types of data?

Peter Gao  49:44

Yeah. So the reality is, it actually just depends on your application. It’s really hard to kind of like, say, you know, a one size fits all type scenario. Like, you know, what is the minimum threshold for it to perform well. It actually depends on how complicated your problem is, how variable your input data is, and whether you have access to like some sort of pre-trained model or not, that can kind of cut out a lot of the work of learning from the process.

Peter Gao  50:14

So I think, you know, my personal like rule of thumb for working with imagery with like a pre-trained model is that you want to get something on the order of like 10,000 examples to kind of like to start off with, and you can usually just like human annotate up to like 10,000 without too much cost to yourself. But then, you know, the sort of more general way to understand this is that you can actually do something known as an ablation study, where if you have some set of data, let’s say that you have like 1,000 examples, then what you can do is you can train, you know, you can set aside like 100 examples as an evaluation set. And then you can train on 100 examples. The remainder are like 200, or 300, or 400, or 500, or 600, 7…8…9… And then you can see like, of these models, that you’re trained on different sizes of the data set, like how well they do on that evaluation set of the 100. And if you see that you can add more data and you know, the model performance is getting way better as you add additional bits of data, then you should probably go get some more data. But if you’re starting to get to the point where you have diminishing returns from just generically adding data, you know, that’s the point where you have to be really intelligent about what data you add to get the most improvement to your model. Because then at that point, most of your error cases are actually kind of long tail, there edge cases. So one last thing I actually want to leave y’all with, I know that was the last question, and we’re like, now at two o’clock, but I think the thing that we want to do with Aquarium in terms of long term vision is that we want to be able to make a system that a person who’s not an ML expert, who is someone who’s just an expert in their domain of agriculture, or a you know, waste recycling, or whatnot, can go into this nice UI, and click some buttons and get a model to do what they want and to improve it over time. That’s really our end goal with Aquarium. And that’s the thing that we’re building towards every day.

Eric Dodds  52:19

Very cool. Well, Peter, it’s been an incredible conversation, as we say to almost all of our guests, I think, maybe all of them, we’d love to check back in as you continue to build out Aquarium and see how you’re doing maybe in another six months or so, and have you back on the show.

Peter Gao  52:37

Yeah, sounds great. Well, it was great talking, and let’s keep in touch. 

Eric Dodds  52:41

As always, amazing conversation. It’s so interesting to meet people and hear their backgrounds. I’m going to … there was lots of technical stuff. So I’m going to say that I think one of the most interesting parts of the conversation that really stuck out to me was Peter’s obviously incredibly intelligent and articulate, but there’s an underlying passion there based on his life experiences, you know, sort of from early childhood interest in robotics, to going through sort of a traumatic experience, you know, related to vehicles. And I am just always amazed to see people building things that are reacting to or built upon really deep life experiences they’ve had. So I think it was a privilege that he was vulnerable enough to share some of those things with us and really appreciated it.

Kostas Pardalis  53:31

Yeah, absolutely. It was a fascinating discussion, actually. And Eric, one of the things that happened to me during this conversation is that I realized that when we are talking about self-driving cars, actually, we’re building robots. For some reason I didn’t think about this before our conversation today. I was thinking of it more like a software kind of problem. But actually what we are doing is building a robot, which is amazing. Anyway, that’s my realization.

Eric Dodds  54:02

That’s one of those things where you say it out loud and it sounds like the simplest conclusion to come to. But until he actually said that I hadn’t made the connection either. Which is so funny.

Kostas Pardalis  54:12

Yeah. Because usually when someone says the word robot, what’s the first thing that you think about? It’s Boston Dynamics, right? Like, these robots, that they dance that they try to walk as a human and all these things. But at the end, that’s exactly what we are doing with a car when we want it to be self driving. We are building a robot. Anyway, that was a very fascinating conversation. It was very interesting to hear about the techniques that they are using on creating, extracting structure out of unstructured data using the same neural networks that were used also was our models with all these theories around embeddings that Peter was mentioning. And two things that I’d like our audience to pay attention to. One is, again, also from Peter, we hear that the relationship between technology, AI, ML, and humans are much more of a synergetic relationship than an antagonistic relationship that the media are trying to portray out there, which is great to hear from him. That’s one thing. The other thing is that at the end, it’s all about the data, right? If you don’t have the right data, if the quality of the data is low, no matter how good your algorithm for your model is, you’re not going to have the results that you need. So I’m really looking forward to having him again in a future show.

Eric Dodds  55:30

Yeah, absolutely. And for all of our listeners, who were also thinking about Autobots and Decepticons in the context of Transformers when Kostas said, What do you think about when you think about robots? I did the same thing. I didn’t have quite as mature of an initial thought as Kostas. So you’re in good company if you thought about Transformers.

Kostas Pardalis  55:52

Yeah, and Terminator. Don’t forget.

Eric Dodds  55:55

2030, we’re getting closer. Great. Thanks again for joining us on the show. Please subscribe on your favorite podcast network. That way you’ll get notified of new episodes every week. We have an incredible lineup in the next couple of weeks. So you’ll want to make sure to catch every show, and until next time, thanks for joining us. The Data Stack Show is brought to you by RudderStack, the complete customer data pipeline solution. Learn more at RudderStack.com.