This week on The Data Stack Show, Eric and Kostas chat with Alex Watson, Co-Founder and Chief Product Officer at Gretel.ai. During the episode, Alex shares his journey in data and how working with the NSA impacted his career in space. The conversation also includes synthetic data, the evolution of machine learning models, boundaries between synthetic and prediction models, and more.
Highlights from this week’s conversation include:
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 00:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future, you’ll learn about new data technology and trends and how data teams and processes are run the top companies, The Data Stack Show is brought to you buy RudderStack, the CDP for developers, you can learn email@example.com Welcome back to The Data Stack Show. Today, we are going to talk with Alex, the Chief Product Officer at Gretel gretel.ai. And we actually have been talking about them for a while. Kostas, it’s a really interesting company. They do a number of things, but the primary thing they talk about on their website is synthetic data. And today, we’re going to talk all about machine learning models and training synthetic data on real data in all the interesting use cases. So this sounds basic, but I want to ask Alex, what their definition of synthetic data is, I mean, you can create synthetic data, you know, in a spreadsheet, you know, in Excel, right, but their flavor of synthetic data is, you know, is pretty specific. And I think it’s really powerful. So that’s what I’m going to ask is for him to define it in, in Grebel terms, if you will.
Kostas Pardalis 01:25
Yeah. And so I want to get a little bit deeper into what it means to generate synthetic data from multiple data sets, like what’s how we can reason about things like accuracy? Like what kind of, let’s say characteristics of the original data sets, we want to recreate. So that’s like something that I’m super curious to learn more about. I think we’d have the right person to do that today, so let’s go and do it.
Eric Dodds 01:58
Let’s do it. Alex, welcome to The Data Stack Show. Thanks, Eric. We wanted to talk about actually, we’ve talked about Gradel Akasa, for some time and synthetic data. So just super excited to actually cover that topic on the show. Been on the list for a while. Let’s start where we always do give us your background. And what led you to grip metal.
Alex Watson 02:21
Yeah, sure, give the two minute version of it here. I started my academic career in computer science, moved out to the East Coast right after September 11. And actually joined the NSA and I was working there for about seven years. Awesome experience, I got to dive in on, you know, early applications of machine learning. And also security, which has influenced my career quite a bit. Over the years since then. 2013 ish, I moved out to San Diego where I am now. I started my first company, a company called Harvest AI. We were helping large companies that were starting at the time to transition to use SAS applications like Google Suite, office 365, Salesforce, AWS, things like that. Help them to identify where important data was inside their environment and protect it. It was a really cool experience there. We built that for about two years. We were at that point, when I offered a series a raise, had some interest around acquisition and actually got acquired by AWS. And I went on to spend the next four years of my career at AWS as the General Manager, launching their first security service for AWS, which was our product at harvest called Macy. It’s a service that customers use today, within the AWS world to identify and protect important data in the cloud. And through and happy to kind of dive in on, you know, that process here, but I think through, you know, both the incredible access that we had inside the walls to data at AWS and then also talking to customers and realizing how difficult it was for them to enable access to the sometimes really incredible datasets that they had they, you know, to enable decisions inside of their business was really some of the that led to the initial pieces that we have with Gretel and synthetic data today.
Eric Dodds 04:07
Very cool. So much to dive into there. i Can I ask one question about working for the NSA because, you know, the government likes working in sort of intelligence type stuff for the government? I think a lot of times probably because of Hollywood, you have like two views of it. It’s either extremely advanced and very scary like Big Brother, or it’s like well, it’s the government and they move slowly. And so maybe the technology is quite good. Like, where on that spectrum was the actual experience of working with the NSA if you can tell us something
Alex Watson 04:42
you had to there is both so my you know, my first job was programming credit supercomputers actually when I started so I got a chance to work with cutting edge multimillion dollar machines and really cool on so you know the scale with which they were working Also, the caliber of the people there are, you know, almost unlike any other place I’ve ever worked well, also really incredible. Also, it’s the government so things don’t move quite as quickly as you would hope. But, yeah, Greta Lou Spencer.
Eric Dodds 05:14
Yeah, very cool. Thanks for indulging me. Okay, let’s talk about how to harvest. So you were going out for a Series A, and then you get, you know, sort of ingested into, you know, a company that, you know, provides, you know, more maybe more data infrastructure than any other company in the world. Right? What was that? Like? Because it may say you work with data at a huge scale. So can you just talk about that experience a little bit? And maybe, you know, especially as Macy’s sort of grew to be a, you know, a very widely used product? What were some of the lessons that you learned, working at AWS scale providing a data service via AWS?
Alex Watson 05:56
Yeah, yeah. You know, with scale, I think one of the things you learn really fast is the details matter. So that’s one thing that really stands out: those things that you only have a couple customers and your, you know, even, you know, large scale customers, but those things that are okay to let slide, yeah, come really big issues when you have 1000s of customers. And that’s really what we needed to prepare for, I think on kind of cool experiences or things that I learned, even during my time there, I think, dealing with the scale, like how do we do natural language processing NLP at the scale of terabytes, or petabytes of data that customers have in the cloud was really fascinating. I also think the experience of taking, you know, the time of single tenant software that we’d written that would run inside of VPC, per customer, to a multi tenant, they needed to support, you know, 1000s to 10s of 1000s of customers in the first month, was quite the experience, what happened had some, some some pretty cool learnings during that process. One of the stories maybe just didn’t cover it really quickly, that like really stood out to me, and just kind of helped shape how I think about building software today was there’s most people know like, at AWS like, everything revolves around reinvent and New York Summit, those two launches, right. And those are the two times that you want to service and we were hitting the ground running and having really good traction, I think with a couple customers, and we were getting ready to launch. Macy to the world, and then fully multi-tenant version that may see. And one of the kinds of challenges that we ran into was we had not enough time to completely finish multi-tenancy before we launched. So our choice was either delay six months and launch it at reinvent or launch at New York Summit, we really wanted to launch, you know, and how could we get there? What can we do? And one of our product managers had a really kind of ingenious idea and said, “What if we launched the whole back end as a multi-tenant, we launched the front end as a single tenant?” So what that meant is that each customer would have their own unique box in the cloud, that would be running our complete user interface stack. And since it’s AWS, it’s never just one box, you have three regions, you know, per zone. Sorry, you have three, three zones per region. So you have high availability within there. So for each region, we need to have three boxes per customer, we forecasted we would have about 6000 customers at launch at the bar window, you know, that’s somewhere on there. So that meant that for us to launch on time, we needed to run 18,000 virtual machines, just out of the user interface for these customers that might sign up. It was such an incredible experience, it was wild, we almost broke CloudFormation doing a deployment at the time, I’m sure it can handle it quite easily now. But at the time, that was pretty new. And we forecasted that if we could finish the multi-tenant version of the UI within 45 days and shut it down there, we would actually have a pretty conservative amount of cost for running all these user interfaces. So that was one of my more wild experiences. 45 days into the launch, we were able to turn 18,000 machines into nine machines. And I’m sure a data center kind of collectively cooled down at that point, but that wasn’t a neat experience and without a hitch. So it’s one of those things like just taking a step back and asking how you can do something when you’re trying to, you know, hit a deadline or do something like that. And, you know, being data driven and decisions you made. We felt like we could get there and we did. That was a really cool experience.
Eric Dodds 09:29
Wow, what a story. What a great story that is so great.
Alex Watson 09:37
There’s a lot of stress in there. It sounds like it’s scary now.
Eric Dodds 09:42
I can only imagine, right? It’s the pendulum swinging between like, this is gonna be awesome. We can pull it off and are we completely crazy? Totally. Well, tell us about the rattle. When did you decide to start it and then Give us an overview of the problems that you solve.
Alex Watson 10:02
So we started Gretel with this thesis. And the thesis was that it was really difficult, as we saw, and I saw as I saw running Macy and talking to, you know, our big customers that are trying to figure out whether all their important data is in the cloud and protect it and figure out if it’s exposed to the world or, you know, answer those questions like how difficult of a problem it is to enable access to data inside of business. And usually, that revolves around privacy, right. So like a contract that you have with your customers for your brand, or sometimes legally enforced, you know, with things like GDPR. And a feeling that we had that kind of the existing methods that are like, Oh, build a wall around your data, build a perimeter, build a better perimeter or VPC, like those things are effective tools, but they don’t work at some point, they’re gonna break. And that’s what kind of leads to breaches happening. And our initial thesis for Gretel was saying, What if we could train a generative AI model? So, you know, very similar technology under the hood to what you see as open AI with the TBT model? On data instead of natural language text? And what if we could get that model to recreate another data set that looks just like your sensitive data set? Except it’s not based on actual people, objects, things? And what effect did that have on privacy? And in theory, if you could pull it off, it wouldn’t matter if someone’s, you know, computer got lifted to Starbucks, and it got picked up and it had, you know, a lot of sensitive information on it. And when possible, maybe we could unlock new ways to share data. It’s evolved quite a bit since then, we’ve got a couple of use cases. But I think that’s still one of the primary ones that we see today is how to address privacy, and how to essentially use these generative models to anonymize data. Yeah, super
Eric Dodds 11:49
interesting. You mentioned a few tools that companies, you know, turn to, in order to mitigate concerns around, you know, privacy and security. You know, you mentioned VPC, for example, those seem to be pretty pervasive. I mean, those are sort of like the default set. Would you agree with that? Is that the most common pattern that you see? I think perimeter is a great term, right? I mean, you know, nothing is obviously not an option for many companies. But to your point, like, there’s data breaches in the news every single week.
Alex Watson 12:24
Yeah. The various levels, you see customers need some customers keeping data within, like, within their own kind of perimeter or their walls, their private cloud, you see other customers using the cloud that will, you know, embrace technologies, which are awesome, in my opinion, like, you know, V PCs, and, and using role based access to things instead of passwords and things like that. So really good patterns all around. But, you know, access control still leads to the chance and the risk of raw data finding its way out. So it’s one of the things like just to, you know, I would say I applaud the effort that a lot of companies put in making that work really difficult, like you start seeing permissions when you’re trying to set up a VPC or an S3 group. And often a developer just makes the change. And they say, Hey, I’m gonna do this real fast, and see if it works. And I’ll fix it later, they forget, you start seeing issues like that. So there’s a whole new class of tools that are being built to address problems like that. But I’ve been in a security role long enough that you start to see, you know, the repeated patterns. And that repeated pattern of a better way to build perimeter around data is one that sounds good. And it works in some cases, but it’s not the, you know, a long term answer. Sure.
Eric Dodds 13:42
Yeah. I mean, all best practices, for sure, right, that don’t necessarily get to the root of the problem. One thing to be helpful, I think, and especially to give context to the rest of the conversation, as we get more technical here. Could you define synthetic data? As Gretel sees it, because it’s, you know, creating synthetic data is, you know, concept has been around for, you know, for a very long time. I guess we could argue how far back in history, but especially as it relates to technology, you know, people have been creating datasets synthetically for, you know, for decades and decades. So, could you help orient us around the term, you know, as it relates to specifically, Gretel does? Yeah. So,
Alex Watson 14:33
it’d be I’ll start with a really broad term, you know, we were describing the, you know, 1970s, someone sitting at a, like a DOS terminal writing up like their own or Unix terminal writing up with, you know, a CSV file, right, of data that you would use to test your program. That’s synthetic data. So, you know, broadly speaking, I would define synthetic data as a computer simulation or algorithm that can simulate real world events. Some objects, activities, things like that. So it could be a spreadsheet, it could be a mathematical formula, it could be a computer program that just spits out random temperatures, ages, things like that for people. So it can be that simple. You also hear the term, a lot of times like fake data, or things like that you have some kind of mock data that might make sense for testing a user interface or something like that. But you wouldn’t want to ever query that data, or ask it questions. In the Gretel context, we use synthetic data to define data generated by a set of deep learning algorithms. So similar, once again, to use that analogy with open AI as GPT models or chat GPT or stable diffusion per image. Essentially, we have models that learn to recreate data like what they’ve been trained on. And you can either create another data set that looks just like it once again, with artificial, you know, people, places, things like that. Or you can prompt the model to create a new class or to boost the representation of class in your dataset where you want to see more examples. And we see a lot of that, too. So maybe summarizing here, like, the deep learning approach allows you to use the data for a lot more use cases, whether you want to use that data to train machine learning models to power data science use cases inside your business or information exchanges, or things like that, we see a lot of this in the life sciences world, where you’ve got companies that are trying to share broadly, you know, research about COVID, or about genetic diseases or things like that. It’s here while preserving privacy. So that is when we talk about, you know, synthetic data in the global context, it’s data that can be used that has the same quality and accuracy as the original data was based on.
Eric Dodds 16:49
Fascinating. Okay, can we talk about similarities to Chad GBT has been making its way, you know, across Hacker News and Twitter and, you know, all over our internal Slack channels, with people doing interesting stuff with it. And you have a lot of experience with natural language processing and algorithms that run on language. Could you explain the differences in that flavor of deep learning as compared with running deep learning on data itself? Right. I mean, that’s an interesting concept to consider in general, right? Can you just explain the difference and sort of the even the ergonomics of how you would approach deep learning on data versus natural language, right? Because there’s, it’s just a really, it seems like a very different paradigm. But it sounds like they’re actually pretty close.
Alex Watson 17:41
Yeah, they are, I think, the underlying technology, maybe to talk about that first, and then talk about the interface for how people interact with it. Yeah, the underlying technology is using a class of machine learning models called language models, or large language models for both open AI and what we’re doing in Gretel, and that we came out of realization really on our part that a dataset is a language and its own kind of right, that makes sense to computers, that is harder for, you know, kind of humans to assimilate. But under the hood, the technology is very similar. We’re using language models, we use a recent class of language models called transformers, that have a great ability to learn data from wide, wide collections of datasets and be able to apply it to whatever context you’re asking about. So you can essentially augment your data with better examples. So I think the open AI GPT examples is very close to what we do credal chat GPT is a layer on top of GPT-3. And it has a slightly different mechanism that’s used for training. And this is, you know, kind of wild to think about, and you really have to dive in here. But under the hood, it’s hard to believe that it works at this scale, but under the hood GPT-3. Or, you know, rattle at this time, is really predicting the next field, if I have a user that’s from if you’ve got a movie review dataset, and you’ve got, you know, people that have consistently rated a movie at this level, you have a new user being generated, probably going to generate a rating within a certain range. So really, it’s just saying, Okay, if I have alteration, what’s the next most logical thing for me to do? Had GPT put a layer on top of that, and did two things that I think we’re really significant. One essentially uses this concept called like human based reinforcement learning where you have humans that are kind of, instead of just getting an algorithm that is the single best thing at predicting what the next token in a sequence is going to be they, it takes a look at the whole result and says, Is this the result that I want as a human or not? There’s human laborers that are looking at it in their sayings, so I asked them to create a list of to-do items for today, like do these make sense to me and a human reviewer at the opening . I will look at it and I’ll say that is the best answer. And then we’ll use that to feed back into The algorithm and come up with better results. So two things that have, I think, that are really significant are kind of happening right now. One, we’re orienting machine learning algorithms to have responsibility, but humans want to see, which is good, right? And the robots will be rising up against us if we’re teaching them to do the things we want them to do.
Eric Dodds 20:18
And we have a say, in the uprising, we have we get a say on the
Alex Watson 20:24
right, all it takes is one person to train in a different way. But fortunately, our training has got the right direction. So that part is really neat. I think the other part that I love about it that I’m really excited to see in the synthetic data world is this natural language interface. It used to talk to models, right? So instead of GPT, two, if we were to rewind back when major GPT version, right, and look at it, and you would give it a couple of examples, you give it examples of tweets or blogs, and it would create new tweets or blogs, like what it was trained on with chatty btw. And increasingly with the other TBT models, you can just say, like, brainstorm a list of to do topics for me to look at today or something like that. Yeah, you have this natural language interface similar to stable diffusion with images, right, where you can say, yep, what a unicorn on a surfboard on Mars, right, and it’ll generate it. So really excited to see this way that we interact with data becoming more based on natural language than SQL queries, or, you know, data engineering that we all have to do to get that kind of answer right now.
Eric Dodds 21:28
Yeah, absolutely. Fascinating. All right. Well, I’m going to stop myself, because I have a little wave of questions backed up, Costas, please jump in here, because I know you have a ton of questions as well. Yeah.
Kostas Pardalis 21:43
Thank you, Eric. So Alex, let’s look a little bit more about synthetic data. And you mentioned it because Eric asked us, like what synthetic data is. And I’d like to get a little bit more detailed on what it means from a datasets to generate more data that are synthetic artificial rights. And they are similar, like they serve the same properties. i What are these properties? How can how we can how should we think about
Alex Watson 22:21
about that? A synthetic data set, and we use the term like if you were to query it, right. And so if you were to issue a, build a dashboard off of this dataset, or just send it to SQL query that you would get a very similar response for an aggregate statistic. So what is the average age of a person who likes to buy this product? Or when I have this spike in activity, things like that will be very similar between synthetic data sets and the real world dataset. There’s a couple of ways we measure this, you know, at first you create your first synthetic data set and look at it like that looks great. I don’t know how it’s gonna work for me. And that’s the first question that, you know, we always hear from our users, the dataset looks awesome. You know, like, when I look at it, it looks fine. But I don’t know how accurate it is. How do I measure that? And so what we try to do, we have both like opinionated ways to measure the quality or the accuracy of synthetic data and opinionated ways, the unimpeded ways make the most sense, when you’re just trying to create an artificial version of a data set, you don’t know how it’s going to be used. So we don’t know what type of machine learning tasks are gonna be used for. So we can’t measure that. So what we do is we look at, I’m giving you a couple of examples, we look at the correlations that exist between each pair of records and the original data set and the correlations that exist in the synthetic data set. So if I knew that, to go back to the movie review data set, right that like, you know, movies with Keanu Reeves usually had really high ratings, right? Like, I would expect another movie and a synthetic data set with catteries to have high ratings and kind of it goes on much more deep than just kind of that two level correlation. But that’s the first thing that you know, one of the things that we look at. Another really helpful, you know, helpful tool we have is called field distribution stability. So we’re looking at this and this is just a pretty common data science tactic. A lot of people will do this, we just automate it, where you might have a dataset that’s got 100 rows in it, or 100 columns. Essentially, we plot that it’s something called PCA, principal component analysis. Now we plot it in a 2d plane. And we look at the difference between the plots or the synthetic in the real world dataset. And we say, when we map these 56, or 100, column data set down to two dimensions, how similar these distributions look, and that gives you that insight as to whether the model is overfitting. And it’s just repeating a couple things inside of there, or if it’s capturing the whole distribution. And the third thing is the most probably intuitive way to think about it. And it’s just looking at the parochial distributions, right? If you’ve got admission times where people in an EHR data set or you have you’re looking at a financial data set Do you have open high, low, close? Yeah, type data, do the distributions of each one of those match what you’re expecting to see. And that’s something we try to automate the whole process and give you a single score that helps you reason about how well the model is working.
Kostas Pardalis 25:15
All right, and this is the opinion 80s ways, what are the opinionated ways,
Alex Watson 25:20
depending on the ways when you know how you’re going to use the dataset, if you’re gonna use it for downstream classification, training? Regression, you’re using it for forecasting, we see this quite a bit in the financial space, right? Where do you want to use time sensitive data, but to forecast a stock price, that’s what that’s going to be, things like that. When you know how you’re going to do that, you can actually simulate running the synthetic data on the same downstream forecasting use case as the real world data and compare the two. Now, there are some really great tools out there that make this easy. So there’s a framework in Python called PI care, a lot of our customers like to use that quite a bit. Essentially, it simplifies this process of testing, how your synthetic data works on classification tasks, or QA question answering tasks and stuff like that, versus the real world data is based on.
Kostas Pardalis 26:13
Okay, that’s super interesting. Like, I can’t feel my soul, like, I have to ask you that, like, where is like, the boundary between synthetic data and prediction of the future of the right, because, yeah, and I am asking that, because you are mentioning is like, okay, like financial models, for example, where you’re using, like synthetic data to go and like, run some models and do the whatever they want to do there. But. And, also, we have, like the conversation earlier about, like, Saudi Putin, like all these things about trying to predict what should be the next, right. Part of the text. So prediction is part of the whole thing that we are doing here, right? So what’s the boundary there between trying to predict what is going to happen, like, you know, at some point in the future, for example, like or like in some kind of like data set, and actually just creating, let’s say, data, that’s they serve some common characteristics, but at the same time, they don’t represent reality, right.
Alex Watson 27:24
I think we’re machine learning models in general, and have a really hard time dealing with data that they’ve never seen before. Oh, when there is a market event to go back to the financial world. Yep, never happened before, it’s unlikely that your machine learning model is going to be proficient at detecting it. But that said, history repeats itself. So one of the really popular use cases that we see in the financial space is when you have rare events, for example, the Gamestop events, or the two that happened, were significant market changes happen due to something that has, you know, happened for the first time, crypto market crashes, things like that, when you want to train your machine learning models to be good at detecting this, and you can only pass that a single example. Once again, it’s not going to do well. So that is an area where I think synthetic data can really help. Today’s synthetic data can really help is that you can give it an example of saying like, hey, look what happened with GameStop, I want you to create another 50 or 100 examples of something like that happening. So better at detecting that if that happens in the future. So those are artificial, they, they’re based on real world data. And they’re based of, you know, kind of learning off what happened in that one example. But they’re not perfect. But in many cases, that actually really helps we see. And that’s kind of one of the neat kind of patterns, we’re starting to see with our journey, you know, kind of building synthetic data, I think the you know, we’re in year three now at Gretel. And first year was like, does it work on my dataset, right? And the next year was like, Okay, but how does it work against my real data, and then now we’re starting to see this kind of tipping point where people are realizing that machine learning models are data hungry, there will always be classes that you’re not good at. So this idea of augmenting your real world dataset with additional synthetic examples that are perhaps trained off public data, so has the world team is before and can I incorporate some of that knowledge into my own dataset helps you build a better data set that can have better accuracy than then than you would have all by itself?
Kostas Pardalis 29:36
That makes sense all right. So in talking about data, what are the types of data we are talking about here because we can have a synthetic picture we can harvest synthetic audio files we can have as synthetics are all in the daybook. So what are like the most common use cases that use see out there like synthetic data is important today. And by the way, I know that because we hear you mentioned many times like the natural language processing part of these shirts, probably more textual. But we have other things that we have like time series data we have structured versus unstructured data. So yeah, great. Well, tell me more about that. Yeah.
Alex Watson 30:27
I’m going to come across to what, you know, probably a little bias here, because, like, I would say, Gradel would be the, one of the leading companies, if not the leading company, and working with tabular formats of data. That’s really where we got our start. That’s, you know, where we built on that said, like, our vision, and I think the vision you described person, DAG data is much bigger than like one type of data. So maybe to talk about the types of data that we see being used for synthetics quite often, that can even give some examples for different types, but
Kostas Pardalis 30:56
you have tabular data to
Alex Watson 30:59
start out. So the stuff that you haven’t said, a data warehouse, a database, things like that time series data, which helps a person in each category of tabular data, then until you realize, like 50% of the world’s data sets, the time is such an important component, that it’s one we actually treat differently. Text to natural language, text, or different languages. And image synthetics are really big, right? So people are using images quite often to train models for self-driving cars, or to recognize problems in a manufacturing line or things like that. So a lot of use cases around that. And, you know, I’d say increasingly getting into video and audio. So some of the new technologies that came out recently, like stable diffusion are really showing the ability to create new variations or artificial versions of images and videos, were the companies that are trying to build, let’s say, you’re an insurance company, and you’re trying to build something to give somebody a better insurance quote, for their house, and you want to look at the quality of the materials that they have. And does it look like they have fire extinguishers and things like that around the house, just from a set of pictures, you never have enough data to start with. So this idea of augmenting images with, you might have a room and you will see a room with really fancy furniture or blondes with more like something that a college student might have things like that. So we’re seeing a lot of use cases there. And maybe the last place to touch on would be the simulation space. So we’re talking about today, we’ve talked a lot about generative models, machine learning models, neural networks that create new examples of things. But there is, you know, in parallel to that there is a simulation space where you might use something like a computer game engine. So Unity would be a good example of this, and Vidya. Nvidia has a neat product called the omniverse, as well, where essentially, they have created a 3d world using a game engine that you can use to create and test these different kinds of simulation based outcomes.
Kostas Pardalis 33:02
Wow, that’s super interesting. Okay, let’s focus on tabular data. What do we mean by tabular data? tabular data is
Alex Watson 33:11
any type of values the term here in the left do you think on it to like any type of structured or semi structured data format, so it could be anything from a CSV file to use the format’s, again, JSON, where you don’t necessarily have the same level of structure, but you can have arbitrary levels of nesting, more advanced data formats, like Parquet, that are really efficient at encoding large amounts of data, or just data that’s inside a database or data warehouse.
Kostas Pardalis 33:37
Okay. And when we’re talking about creating, like synthetic data here, what will they do? The most common approach that you see out there is like, Okay, I have, let’s say, a user table, right, with like, 1 million users. And I’d like to see like, 2 million of these users having like, let’s say, okay, similar characteristics, or like the distribution of like the users like similar, let’s say in terms of like, the age or the geography or like, whatever, what kind of information we capture are already on this table. Is this like something that ‘s like the most common use case that you see out there or like people that are actually coming in, they’re like, Okay, that’s my database here, right? Like I have users and the users have, I don’t know, products, not they procured at some points. And I also have, let’s say, my inventory. And I also have, like, let’s say, we represent the whole domain of like, what the company is dealing with, or the user is dealing with, which can be like, quite complex, right? And they want like to synthesize the whole database that is other
Alex Watson 34:50
relational components. So not just capturing the relationships that are inside a single table like your users table, but capturing relationships between users and then Inventory table is a really cool challenge in tennis and dedicated space. Yeah, popular. So you know, to answer your question when we have users come in and use our platform, often, especially if you’re doing pre production testing, like a really big use case, we haven’t talked about yet, as much as that you are trying to build a version of your production environment that you might use inside a development or a staging. You don’t want to have real world data, but you want to have it reflect what’s happening in your production system. So this allows any of your developers to use it in a Hammermill way to investigate different records without worrying about privacy or things getting compromised or anything like that. So in this use case, we have customers often that will create a twin version, a diverse staging test version of a production database. Essentially, they’ll queue it up to depending on how recent they need to keep it once an hour, once a day, will run the job will bring in all the new records or records that have changed, train this model on that data and create another, essentially create another database that sits inside of your near test to your staging environment. The really neat thing is not just the database, you’re getting here getting a model. And this model can be used to either subset that data. So if you have, you know, 2 billion records inside your production database, and you can run that in your dev staging environment without having insane, you know, DynamoDB costs, you can create a smaller data set that captures as many variations as possible. So it’s much more efficient than just taking a slice of that data set. It’s more native to it. Or and I think what’s another really neat use case for scale testing. Yep. You’re, you want to test the ability of your application to handle 10 or 100 times the amount of data you might encounter on a typical day, without just repeating the same records over and over again, you can use that same model to generate new variations of the data that you can use the test is
Kostas Pardalis 37:00
not super interesting, and how CDC and this processes kind of can take off like through like, to the developer experience, let’s say I’m like a developer, I have like my database to date. And I want to go and try my best, the limits of my production environment, right? What am I going to be doing and how am I going to be using glide results to do that? Yeah.
Alex Watson 37:26
So this process with Gretel is two stages, you’ve got your production database that’s sitting, let’s say it’s a Postgres database, or it’s an atlas database, hosted or not hosted really doesn’t matter. You want to create a version of it or your lower production version, there’s two steps you need to do. One, you don’t want the synthetic data model to memorize important information, like customer information names, customer IDs, and things like that. So we have two steps. These are both powered by a cloud API. So you can either run in the cloud, sometimes customers have really sensitive data requirements. So they need to run inside their own cloud. So you can deploy these workers as containers to your own cloud. But the two steps are one scan, and use NLP to identify for example, sensitive data, customer IDs, names, things like that. From there, you have a policy that says, Whenever I see this, I’m going to redact it, I’m going to replace it with a fake version of it, I’m going to encrypt it in place or whatever your company feels is appropriate. Often, we see people using fake data because it is just the name, for example, my user name. I might, you know, replace it with another artificial name. Yep, just make sure the model doesn’t learn it. You have a risk there that even when you do that traditional D identification, the other attributes of your dataset that by themselves aren’t, aren’t identifying, for example, like my age, my location, a lot of times advertising via this precise location will put you right at somebody’s house, right become a very identifying when you put those together. And the real power of these synthetic models is that they will create artificial versions of those things. So you remove or replace the names inside your data set, you create new artificial locations, shopping cart activity, like whatever you have inside of your data set. So the second stage is the data center where you train a model, and then you tell that model, I want to generate 10 times as much data or I want to take 1/5 as much data and you take the outputs and essentially put that right back into your database. So you create a twin database that you can use for testing.
Kostas Pardalis 39:33
Okay, and how long that process takes, how long it takes to dream this model.
Alex Watson 39:41
That varies, and it varies based on what your use cases are. And so we have this kind of belief that there’s no one machine learning model to rule them all. And each one has different advantages. So if you are going we’re a machine learning use case and you care about accuracy. You’d want to use a deep learning generative model Like Carlo, which gives you the best performance of anything. Alternatively, you have GaNS. So generative adversarial networks, which don’t offer quite the performance of our language models, but they’re pretty fast after training. And, and we have a built in really working with customers that when they have, like, tremendous scale, they need to run out, we built statistical models. So these are based on copulas. So instead of using a deep learning technique, they use a mathematical, really neat kind of technique to learn and recreate distributions in data. So, essentially, based on your use case, like do I care about accuracy in my training a machine learning model on this, use a generative algorithm that might take an hour to five or six hours to train on a data set, depending on the size of the data set. If you want speed, and you want to generate data at 100. Meg’s per second, so I can create 40 billion records to test my dataset. That’s where, you know, I really suggest using, we call it our amplify model, but the statistical model and on a, you know, 32 core machine, we’ve clocked it at about 100. Meg’s per second it can generate so if you’re generating billions of records, it’s entirely possible to do that within a day, instead of having to wait a month
Kostas Pardalis 41:15
to bottle to do it. Yeah, that makes total sense. And like how that’s very interesting, actually, like, there isn’t like this trade off between, like, let’s say, fidelity and time that you spent, like training the model, right? Again, going back to what it means to represent with accuracy, like the characteristics of, of the data that you have, like initially, what does this mean, like from the user perspective? Like, how can I reason as a user about that stuff, because it’s on a high level, it’s easy to understand. But I think that when you start, like working with a real example, and you have, you know, like your own data are there, things are like, much harder like to figure out how you can reason about these things.
Alex Watson 42:10
Often, it’s somewhat, at the end of the day, depending on the domain. And like the use case, you’re going after, you know, we see a lot of repeated domains that we talked to you, we’ve got a discord chat channel, where we talk about things, from life sciences to things that we see in the ad advertising space, another area that’s really kind of picking up on that data. So we can reason about those, I’d say between our top end, most capable language models and GaNS versus our statistical models, you’ll see about a 10% decrease in accuracy. So were you to train that as a classifier, that downstream dataset model with, you know, using the statistical method would be about 10%, less accurate, on average, than this and one of the things, I can link to the link to you guys, after the show here, we run all of our models against about 50 different datasets, and then compare the results in the accuracy of each one. And you can kind of see how the you know, state of the art language model performs versus a state of the art again, versus the, the statistical model and kind of make your own decision there. We also are realizing that so many of our users don’t have time to make this decision. So you know, we’re introducing these things called Auto params, that are on by default with many systems now. And it just looks at the size of the data set, and it says, Are you trying to generate, you know, to use that example, again, 40 billion records? If you are in use, it will just pick the right algorithm to do this for you. You know, increasingly like me I think all of our vision is that six months from now, people don’t have to worry about What model to choose for this use case, we just pick it based on what we’ve observed with the data.
Kostas Pardalis 43:52
Okay. All right. And one last question for me. Because we’re getting closer to the buzzer here, as Eric usually says, and I want to give him some time to ask any questions that he has. If I’m new into the likes, synthetic data worlds, where should I look to learn more, and play around like, with technologies or like tools or like anything else out there that exists right now?
Alex Watson 44:31
Yeah, so in this world, I would, I mean, of course, first would recommend starting with Gretel. So just a quick thing on that, and then I’ll mention a couple other platforms to check out as well. Our underlying models and code are all open source, so Brettell, synthetics on GitHub, so you can see how they work. You can introspect how we do privacy, things like that. Our service has a free tier, so all you need is a G email or GitHub to sign up and we have an example data set. So we have these local On interfaces where you can just say I’m trying to balance a dataset, I’m trying to classify a data center and create a synthetic version of the CSV that I have. You don’t have to write a single line of code, you can do it yourself. And that’s where I always recommend starting, because it just makes so much more sense after you’ve tried it. So that part is free. I would also definitely recommend open AI has a really great playground for the chat GPT and open AI GPT models, just trying some prompts or trying to send some data in and tell it to summarize something for you or create a list for something I think is that kind of gives you a feel for where models are today and where they’re going. I mean, other things too, as well. Awesome. Thank you. So
Kostas Pardalis 45:40
WODs Eric’s microphone is yours again.
Eric Dodds 45:46
Oh, wow. So I feel so empowered. Yeah, this has been such a fascinating conversation. Alex, I want to pick your brain here in the last couple of minutes on your thoughts on sort of the impact that these technologies will have? You know, I think, as we think about Gretel, you know, one example we talked about, as we were prepping for the show, as you know, hospitals, being able to share records around, you know, a particular disease, right, in order to help researchers and medical professionals, you know, solve a problem, you know, and help treat that disease or even cure the disease, which is really incredible. And then you have, you know, sort of, I would say, things that are in a little bit more of a gray area with like, stable diffusion, right, here, even chat GB t where the uses can vary, you know, widely, right, and can even be used for things that, you know, people would consider unethical, you know, sort of depending on what you’re talking about. And you have really deep experience, I was thinking about this, as Costas was talking, and you were explaining a lot of the stuff I mean, you have experience with, you know, intelligence, the government and building AI technologies, solving privacy problems. Maybe a good way to frame my question would be, do you think about stewardship of these deep learning models as an artificial intelligence in general? And if so, what are the things that are top of mind for you, as we break new ground with deep learning and the ability to produce all these novel outputs? Yeah.
Alex Watson 47:38
Great question. Maybe that’s, you know, kind of two parts of the question, you know, where could this be transformational? Or what are we going to see across these different technologies, both for kind of sharing data, or creating data? And then what are the ethical implications we need to think about around that, or where this is going in the potential of it? There is a very good chance that, you know, these, this kind of, kind of back this up in a second, we start something a little bit more bold, but like this will be the biggest innovation issue that has happened since cloud computing. And the reason, I believe so, is because these models give you the ability to distill and disseminate information or intelligence in a way that has never been possible before, right, natural language interface, you can query. So I think that’s huge. And we’re just starting to see the use cases for it. Speaking of the data sharing use case, for example, like with life sciences, institutions, things like that, right, like so, data driven healthcare and medicine is like, you know, that anyone in the space would say, that is like the biggest potential for helping health, you know, that they can see, the biggest limitation that they have, is that often that data is siloed within a particular region. Sure. If you’re trying to create a cure, that’s going to work for people across the world, but you only have access to one demographic, right? For example, the UK Biobank right? How do you know that it’s not just you created or you found a signal that is then a population that will work everywhere. So the power of this and we did a really cool study with Illumina working on genomic data was showing that we could in fact, synthesize the one of the most complex data sets it’s ever been created. Sure, I started with mice, which was kind of funny. So you know, but even the mice had about 100,000 columns of attributes. So we’re only gathering and what we showed is that we could synthesize that dataset, create a totally artificial version of it, but then recreate the results of a popular research paper that have been created using that data, which was cool. So a lot of work left to be done. They’re both on, you know, the sheer scale of human genomic data. And then also the privacy but the potential there is that a researcher anywhere in the world that had an idea on how If you cure rare disease could test that against every hospital in the world, which would just be. So a really exciting example there, on the chat GPT and open AI approach, I hear a lot, particularly stable diffusion that is just for creative use cases, it’s just for kind of like messing around. And I would challenge that and say, That’s just where it is today, it’s not going to be there for long. Yeah, and what I think is missing right now is the confidence that you have, that model is going to output what you’re looking for. Right? So you could say, like, you know, generate a picture of me standing on a mountain, like drinking a coffee or something like that, right. And like, maybe the first time I’ll do it, second time, third time, it won’t. And in the data world, in the conversations we have with our users, right? Like, there are tons of applications for machine learning, training machine learning models based on being able to generate new images, but you have to have confidence that what the model is outputting meets your expectations. So I think that’s going to be the next big, you know, kind of big thing there. But I do think that these models are going to, you know, in one way or another, they’re going to be everywhere, right? So training, creating more training data for models, whether it’s summarizing a meeting that you had automatically for you at the end, or things like that, you’re gonna see these models by a bit. And the last part on ethics and pet stewardship, where this goes, it’s an interesting question, particularly how you kind of phrased it with the background around, you know, intelligence and things like that. And when technologies exist, when they get created, they will inevitably, by some level, be abused. So that will happen. And so I would personally vector a lot more towards openness, and, you know, relying on society to solve these problems together than having the risk of, you know, trying to control it, but then essentially just creating a small set of, you know, governments and rich companies that have access to this technology. So I really kind of applaud the open source movement here. All open source publishing, and research and things like that. That approach, I think, is working well. Things start to, you know, I think historically have gotten more problematic, when that gets closed off or limited. And then you don’t have the kind of the ability for a community to look at something and give you an opinion on whether it’s ethically correct, or we should do something about it.
Eric Dodds 52:16
Sure. Such insightful answers, I will be considering these things, definitely for the rest of this week. And probably long after Alex, this has been an unbelievably thought provoking show, and we’ve learned a ton. So thank you so much for appreciating it. I think my big takeaway from this show, Costas is that Alex is number one, so approachable as a person, but number two, has such a variety of deep experience in the space, you know, from government intelligence to startups to, you know, delivering things at scale on a crazy timeline, you know, within sight of, you know, within AWS. And so I just grew more and more to respect his opinion throughout the show, which made his final thoughts on where these types of deep learning technologies are going, I think, even more poignant, for me, and I really agree with him, I think he it was a really fresh, honest take not to say, Well, you shouldn’t use it for this, or you should use it for this. I mean, he acknowledged outright that, you know, these new technologies are always used in ways that, you know, humanity probably shouldn’t use. And doing things in the open, is, you know, a really healthy antidote to that. And so I really appreciated his perspective on that. It sounded simple, but I think it was very powerful. And something that I’ll definitely keep from the show.
Kostas Pardalis 53:57
Yeah, 100% I totally agree with that. I mean, I think at the end, especially when you’re talking about technologies, or knowledge in general look, I’m like, I don’t know, like, changing the very foundational way, like, the way that we operate as humans. Yeah, it might be scary. Like, obviously, we can make mistakes and use technology in the wrong way. But in the end, like that’s how I don’t like human models to make Congress right. Another thing that we can change and I don’t think that there’s that much value at the end in not taking the risk of having access to these new tools or like this new knowledge and again, the best way to protect humanity is to make these things available to everyone. So I totally agree. I think we’re going to hear more about these technologies, and okay, there’s a lot of, let’s say, also like Guy The Hype right now. And we’re still just scratching the surface of what can be done with these technologies. But I have a feeling like in the next couple of months, we will see much more practical and interesting uses with these technologies. And we’ll have more people on the show also to talk about that stuff.
Eric Dodds 55:23
Absolutely, no, it was. It was a great episode. And we want to have them back on. But like we said earlier, when we were wrapping up the year, we wanted to talk more about some of these emerging technologies like Chuck GPT and gravel that are forging new ground. So thank you for joining us. A subscription if you haven’t told a friend and we’ll catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me Eric DODDS at Eric at data stack show.com. That’s e r i see that data stack show.com. This show is brought to you by RudderStack. The CDP for developers learn how to build CDP on your data firstname.lastname@example.org