Episode 145:

What is Synthetic Data? Featuring Omar Maher of Parallel Domain

July 5, 2023

This week on The Data Stack Show, Eric and Kostas chat with Omar Maher, the Director of Product Marketing at Parallel Domain. During the episode, the group discusses synthetic data in the context of computer vision and autonomous vehicle development. Omar shares his background in data and machine learning and explains how synthetic data can be used to generate labeled data that is fresh, clean, and useful for training and testing machine learning models. The conversation also includes the challenges of obtaining high-quality labeled data for computer vision projects, the importance of addressing edge cases, ethical implications of using synthetic data to train AI models, and more.

Notes:

Highlights from this week’s conversation include:

  • Omar’s Journey into Machine Learning and Current Work at Parallel Domain (3:25)
  • Interest in Data Analytics (6:27)
  • Challenges with Labeled Data (8:02)
  • Introduction to Synthetic Data (11:27)
  • Challenges with Real World Data (16:28)
  • Parallel Domain’s Background (19:44)
  • Improving Machine Learning Models with Synthetic Data (21:41)
  • Using Synthetic Data to Improve Performance (24:56)
  • Combining Synthetic and Real Data (27:34)
  • Pipeline for Synthetic Data Generation (29:46)
  • Simulating Realistic Environments and Sensors (32:44)
  • Building a Realistic Simulated World (35:48)
  • Complexity of Synthetic Data for Machine Learning (38:36)
  • Advancements in Gaming Industry and AI (42:27)
  • Synthetic Data Across Different Domains (46:03)
  • Ethical implications of synthetic data (48:34)
  • Final thoughts and takeaways (52:29)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:03
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. Welcome back to The Data Stack Show, Kostas, we have a really exciting subject, synthetic data, but in an even more exciting context, which is imagery, video and self driving cars. So Omar from Parallel Domain is going to, I hope, I’m confident he’s going to teach us so much about synthetic data. And I think we’re going to just learn a ton about self-driving cars and what it takes to even get, you know, training data and go through that entire process. I’m really interested in synthetic data in general, you know, we talked with each other . I think we had one other guest on the episode, actually, very few episodes on synthetic data. And the parallel domain is pretty specific, right? They’re pretty opinionated on the type of data that they work with. And it’s like the most extreme type, right? You’re talking about imagery, you’re talking about, you know, labeling that is geometric. I mean, it’s crazy. So, I guess on a personal level, I want to know what attracted Omar to that sort of difficult problem. So that’s what I’m going to ask him about. How about you? Yeah, you’re right, I

Kostas Pardalis 01:43
I think it’s the second. So that we have synthetic data. And the first one, if I remember, like, correctly, it was more about like, kind of like tabular data, right? So yeah, sure. Here, it’s going to be more like visual data. I mean, I have so many questions, to be honest, I want to understand like, what’s the differentiation between these data? Like, what does it mean? Like, data from that compared to have like, synthetic data for like tabular data? Understand what it means to simulate this, what it means, like, what are the types of like, the labels that you use? And why and like, all that stuff? And I want to say, what’s the relationship also, between whatever, like, combined, like parallel domains is doing and things like 3d graphics? And computer games? Sure. Why, you know, like, all these things that I mean, she’s we had the, like, the first computer, we were trying to simulate reality, in a way, right? So there’s a lot of overlap between so many different domains. And I’d love to hear from him. Like, what’s, how much of an overlap we have there? And what that overlap means in terms of like, moving knowledge from one domain to the other, right? And what does this mean for the future? Of like, always, very interesting technology. So let’s dive in. And let’s have this conversation with you.

Eric Dodds 03:17
Let’s do it. Omar, Welcome to The Data Stack Show. So great to have.

Omar Maher 03:22
Eric, pleasure to be here. Thank you for hosting me.

Eric Dodds 03:26
Or you have a rich background in all sorts of machine learning, and really, sort of, you know, interesting AI use cases. So give us your journey. How did you get into data, and then what led you to the parallel domain?

Omar Maher 03:43
Awesome. So I started playing with data since I was in college, where I was interning at a major software company in Egypt, where I’m from originally, and I was applying for a web devs kind of, you know, role for the internship. And then I met one of my best friends now, who inspired me about, you know, business intelligence, data warehousing, you know, data engineering, stuff like that. And my journey began since then, that was like, you know, what, 13 years ago, 14 years, something like that. So I started business intelligence, data warehousing, and I moved to data mining. Once I graduated, I co founded two technical startups that use machine learning heavily for personalized recommendations. We were building some sort of, like the Yelp of Egypt, you know, social reviews and stuff like that. And then from there, I started assuming like, you know, Director slash VP roles in multiple companies for advanced analytics and machine learning, right. So I worked for some time in Dubai. And then I moved to the United States to build a machine learning team in a company called S3, the world leader in geospatial analytics. And it was super fun. We put together a team of, you know, AI experts, data engineers, data scientists, we worked with a lot of customers in the United States and other places, you know, All public private sector is building machine learning solutions, building products, using machine learning, etc. And then I recently joined Pearl domain to focus on synthetic data, which is a very exciting, you know, area that I’ve experienced the pain of not having like good quality data throughout my life. So about like, 1314 years of playing with data, mostly machine learning related, working with customers, different places and building products around.

Eric Dodds 05:26
Awesome. Do you remember, this is rewinding in history a bit? But do you remember that friend who sort of got you into like business intelligence and using databases to do analytics? Do you remember the first thing he showed you that made you think like, oh, wow, this is like, I have to dig into this.

Omar Maher 05:47
He showed me a dashboard. And yes, actually, that’s fine. Because actually, it’s the first time I think about that view. So thank you for reminding me of this beautiful moment. I literally remember that moment when I was sitting, you know, at his desk. And he was like, sharing the news with me that hey, Omar, we don’t really have openings for like web dev, but we have a super interesting thing on the side that I’m doing that I’d love to have you work on bi that our housing and he started me showing, you know, this beautiful dashboard. I think it was some sort of Microsoft technology or something, you know, and then he started talking about the, you know, the process behind that dashboard, what they’re doing to clean the data and store it in data, raw et cetera. So yeah, that was that was the moment.

Eric Dodds 06:27
Interesting, maybe Power BI or something? Did it feel like you are interested in web dev? Did it actually feel a little bit contiguous, because you write code on the back end, and then you’re like, displaying it, or getting some sort of visual experience?

Omar Maher 06:43
It was an interesting moment, because I spent almost a year preparing myself to become an expert web dev. You know, I took a course where I read a lot of things. I built different systems for my gym, for example, you know, that I was going to etcetera. But the twist of like, having a look at the world of data analytics. And you know, there are houses that get that point that were interesting, because I started seeing the kind of business value that this can drive, right? Because you can build web applications that people can use, and this is going to generate data. But what are you going to do after that, right? Like, unless you do something to make that data accessible, fresh, clean, useful. It’s not going to be of much use. And that’s the kind of the art and science of data analytics, which got me eventually into data mining and stuff. So I think the pivot was nice. Yeah.

Eric Dodds 07:39
Yeah. That’s super cool. Okay, I want to talk about synthetic data. But can we talk about the problem first? And so maybe, could you describe for us a project that you were working on, where you needed to train a model? And it was just so painful, because you didn’t have the raw material that you needed?

Omar Maher 08:02
Yeah, that’s easy, because I can reference like literally like 3040 projects that didn’t start. No, literally, that’s if I just take like the last five years of my life here in the States, or six years working as a director of AI in ESRI, I think at least 50% of the computer vision projects. So we were using satellite imagery and drone imagery, for example, to help our customers extract intelligence out of that, right. So everything from like detecting building footprints to assessing damage after hurricanes to, you know, for working with insurance to quantify the impact of like storms, or something on houses to working with agriculture, on understanding the crop health and, you know, assessing, you know, the crop growth and stuff like that, all of these projects were computer vision related. And we needed high quality labeled data to train the deep learning models to make these detections. And unfortunately, at least half of those products didn’t start, because we didn’t have that good quality labeled data. So customers, for example, would have very few like labeled data. And by label data, as it goes, obviously, you know, you know, they have the imagery, and they have the labels, which is like the polygons around the houses or the labels for the crops, whether you’re healthy, etc. So either they would have zero label data, or they have some that it’s not enough. And even when we showed them how to label data themselves, like the tools, they either didn’t have the time, nor the expertise, nor the like, in a workflow to support it. Yeah. Unfortunately, most of these, like 50%, at least of these projects didn’t start in agriculture and national government and insurance, you know, even in retail, because of that reason, and that’s why I was so excited to like, you know, join a company that is doing synthetic data because I’ve been living in this pain for like, at least half a decade of your life.

Eric Dodds 09:58
Yeah. Okay. I have a DME business question, though, so you’re working with satellite imagery, you know, to a multi billion dollar company? Could you? Is there not a revenue line to like, launch some satellites and start taking those images and sort of create, you know, almost like vertically integrating the data that you need? Or is that? I mean, I don’t know anything about the regulations of launching a satellite into space, but it’s actually interesting to me , like, can you not just take more pictures from a satellite? I mean, I’m sure you can’t. But that’s interesting to me. Well, actually, here’s the

Omar Maher 10:35
thing. So the good news is that we are in no shortage of pretty pictures, right? Like, we have a ton of satellite imagery already, by the existing providers, it’s just the fact that we need to label those images.

Eric Dodds 10:48
Right. So it’s a labeling issue. Not exactly. Images? Exactly.

Omar Maher 10:52
I mean, yeah, there is definitely a need to have, like, you know, high quality and recent images and stuff like that. But the good thing is that there are a lot of companies out there already doing that. And you know, you can contract them, you can work with them, you can buy data from them, etc. The thing is, you have the needed labeled data on top of this imagery to do the workflows that, you know, that’s usually the bottleneck, right? And there are a lot of companies out there, like, you know, providing labeling service, etc. It’s just when you have a customer interested in doing something, and we don’t have that data right now, that’s usually the problem, right? Sure.

Eric Dodds 11:27
Yeah, that makes total sense. Okay, so what tell us about synthetic data? Like, what? When did you first experience it? Actually, I’m just going to ask you about all these pivotal moments, you know, throughout the last, you know, decade and a half. When did you first experience synthetic data? And what was the light bulb that went off? Where you said, Well, I think this is actually the solution.

Omar Maher 11:50
Totally. So I, I used to hear about synthetic data, like early on, like, I think, what like four or five years ago, there were different attempts to use GaNS, you know, generative adversarial networks, to come up with, you know, synthetic images and use those as input training data for like, let’s say, geospatial work, right. So there are a bunch of companies out there doing these kinds of experiments, you know, hey, we’re trying to detect this kind of vehicle, for example, from satellite imagery, but we don’t have a lot of data. Let’s go generate some synthetic data using GaNS. The quality wasn’t great. To be honest, the results weren’t great. So I Yeah, you know, I heard about it, but didn’t read the news. I think the pivotal moment when I was I remember that moment when I was, like, you know, doing my daily exercise on the treadmill. And I was watching the Tesla AI day, the first one, where they showed for the first time, the simulation and synthetic data engines and products that they have to train. The autonomous driving for Tesla. That was like the first moment that I see, wow, this is amazing, because they showcased really impressive, you know, technology that can simulate with high realism, different locations, different weather conditions, and different edge cases longtail scenarios, where you want to train those self driving cars on, you know, like, you know, pedestrians, jaywalking, for example, or, you know, cars, or objects, partially occluded, etc. And it was pretty realistic. And they spoke about, you know, the different components that go into this kind of synthetic data engine simulation that underlying the graphics, you know, the weather parameters, etc. So that was the first time that I saw that, and it was pretty impressive. And I thought, wow, this is much needed in many different domains, not just self driving cars.

Eric Dodds 13:40
Yeah, absolutely. Well, let’s step back just a minute. Are you so synthetic? Did it make sense, right, you’re sort of creating something that isn’t naturally occurring, right? Like we hear, you know, synthetic drug use or other things like that, where it’s like, okay, you’re combining different things to make something that doesn’t necessarily naturally occur in the real world? It makes sense that synthetic data follows along the same lines, but how would you define synthetic data? Do you have sort of a top level definition of like, what demarcates synthetic data or makes it categorically synthetic data, as opposed to, you know, a training dataset or something else?

Omar Maher 14:27
Absolutely. So I think at its heart, it’s any data that is generated artificially by computers instead of being generated or captured in real life? Right? So that’s the bottom line, right? And there are different ways to generate this data. You can use, you know, classical ways, like, you know, procedural generation, for example, you know, game engines, et cetera. Or you can use AI based methods like, you know, GaNS or stable diffusion or, you know, self supervised learning any kind of techniques that we’ll hear about today. So that is different used to generate this data. But the bottom line is, it’s data that’s artificially generated by computers, trying to be as realistic as possible to the real world data, instead of being explicitly, you know, captured in real life, right? So, a car having a camera, collecting data out there, this is real life data, a form collecting, you know, input data from users that they are inputting this has been worth data. Now you start using statistics to generate tabular data that looks similar to that synthetic, you start using like, you know, graphics and procedural generation and, you know, other, you know, AI Bismuth alginate, like good looking images that could be used for training machine learning models, that synthetic. Now, one thing to mention, though, usually, when synthetic data is mentioned, it’s usually mentioned in a machine learning context, as data suitable to train machine learning models, right? So theoretically, you can generate synthetic data to do different things with it. Right. But usually, when it’s mentioned these days, it’s like, what are machine learning purposes to make machine learning models better? Whether those are computer vision or not?

Eric Dodds 16:06
Yep, totally. Now, that’s what parallel domain does, right? So can you give us a parallel domain entirely in the world of synthetic data? What does parallel do? And what specific type of synthetic data do you specialize in? What do you create? How do your customers use it?

Omar Maher 16:27
Absolutely. So I want to expand on this, but maybe just a little before that I want to share something about the inspiration or the motivation of why this company was built and why this whole journey started. And I think we’ve touched partially on that, right. I mean, here’s the thing, like getting back to the example that I was saying about geospatial imagery, not having labels, not having good labels is like, you know, one problem with real world data. In fact, there are a lot of other problems or challenges. One, it’s time consuming. Like the time you spend on collecting real world data, let’s take an example. Let’s say you’re trying to get some data for a self-driving car, or a smart car, or to figure out its way and stuff like that, right? You go out and you start collecting a ton of data. And then you need to process this data. And then you need an army of people to label this data, because you need different kinds of labels, right? Like you need bounding boxes, you need semantic segmentation, you need to do depth, you need a lot of things. And while you’re doing the labels, it’s usually like, you know, needs a lot of quality assurance, right? So you have people labeling, you have people revising those labels, and then you have, you know, the need to understand the label specs, how exactly are you going to label this etc. So labeling is a problem. It’s error prone. Like a lot of faults or mistakes, I’ve experienced this. We worked with labeling teams. For example, the definition of how we need to label something could vary widely from person to person, not to mention from a company or a team together, right? Like, like when you say, for example, let’s label a, you know, bicycle bicycles, do you have a separate label for the bicycle itself and the writer, or they are all in the same bounding box? So that’s a difference, that’s a label spec that you need to align on, right? And then like high impact? Exactly. So you spend a lot of time collecting data, you spend a lot of time labeling data, you usually have a lot of errors in these labels. And sometimes some labels aren’t that, you know, feasible to collect in the first place, like how like, let’s say, for example, estimating depth for cars, right. So for any self-driving car, they need to be able to accurately estimate the depth from images, right? Some of them will use LIDAR to estimate the depth. But in addition to LiDAR, you also need to use images, right? So how are you going to label depth? It’s very hard for a person to actually label depth, but it’s easy to do with synthetic data, because you have total control over that.

Eric Dodds 18:52
How are you going to leave it for all three amateurs?

Omar Maher 18:55
Yeah, exactly. How are you going to label for example, for optical flow, which is understanding the motion of objects, it’s very hard, you gotta keep track of the different frames of the images and say if this moved or not. So some labels are almost virtually impossible to label. And finally, the iteration speed, let’s say you come up with a new idea. You want to have another label or object. So it’s slow. Getting labels requires a lot of time, it’s error prone. Some labels are almost impossible to get in real life, and it’s not so efficient for iteration. This is not saying that real life data, we should not be using real live data. I think it’s very important. But complementing it with something better, that can solve all of these problems. Does it make sense? And that’s why peril domains started, right. So the company started about five years ago, you know, founder, the founder, Kevin McNemar, actually had background in like, you know, graphics and gaming, you know, build a lot of cool projects and major companies, you know, he led you know, seemingly and related programs in Apple, for example, et cetera. And the idea is, hey, let’s build something that can simulate real worlds and use this in different contexts. And it happened that one of the early customers wanted to use these virtual worlds for machine learning purposes, right? To train better perception models for, you know, self driving cars. And that actually was the moment where the company actually focused big time on creating these virtual worlds synthetic data to empower perception machine learning teams. So what parallel domain does is we specialize in building and generating synthetic data. To help perception AI perception teams develop better machine learning models, I by developed I mean both training and testing, right. So we generate highly realistic and diverse and labeled synthetic data for both cameras, LiDAR, and radar. And we’ve been, we’re mostly focusing on outdoor scenarios, empowering, you know, perception teams working with self-driving cars, you know, those systems, you know, for smart teachers in cars, helping them, you know, better parking and stuff like that. delivery drones, outdoor robotics, you name it, we’re working with the top names in the industry, we’re working with, like, you know, Google to do research, Institute, Continental, woven planet, etc. And what we do is we work closely with those machine learning teams, on helping them improve the performance of their machine learning models by providing synthetic data, especially for the cases where their models aren’t performing greatly, right? So in many cases, this would be like, you know, the edge cases or the long tail of scenarios, like Think, for example, helping those cars, better detect partially occluded objects. Let’s say you have a kid, you know, partially occluded by a bus, you definitely don’t want the car to miss that detection, right? And here’s the question, how many of that kid in that case can you have in real life as training data to better train your model, right? You’re like you barely can have like, you know what, 10s of these scenarios, you need millions, or hundreds of 1000s, at least, right? And that’s what the synthetic did. Excellent. So we provide highly realistic unlabeled data for these edge cases and longtail scenarios, you know, jaywalking, vulnerable road user standing on the side, partially limited, you know, parking scenarios, you know, debris detection, like we simulate a lot of debrief scenarios, for example, etc. And we help customers get that data, train their models, and the result, most of our customers will get better performance, detecting these edge cases, or, you know, improving on these edge cases. And that happens actually, across the board. And yeah, so I’ve been in the business working with these customers for five years. And throughout that journey, we have built a team of like, you know, about, like 70, something, you know, people in San Francisco, mostly in Canada, Vancouver, but we are a global company, we have people in Germany, and you know, different other parts of the world. So, yeah, we’re pretty excited about that.

Eric Dodds 23:07
Amazing. Well, I have so many questions, but I know Costas, probably, as many of them said, too, but a couple, just a couple more questions. For me, the first thing that sticks out, that’s really interesting, as you know, there’s that old phrase, you know, time doesn’t slow down for any man, you know, like, everyone gets the same number of hours in a day. But it’s really interesting to think about, you know, a partially exposed kid, you know, in front of a bus, right? Time doesn’t speed up to train an AI model, either. And, you know, that doesn’t happen that often. But it changes a family or community, a city when it does happen, you know? And that’s really interesting to think about, let me put on my hat as a parent, you know, and I know you’re a parent, too. Yep. And if you and I didn’t work in the data industry, AI already seems a little bit mysterious. And then if you told me, Well, we’re training this model that makes the car break, but we’re using a lot of fake data, you know, to do this fake data maybe fake is unfair, right? It’s synthetic data. But if you told me that, and I didn’t work in the data industry, that might be a little worrisome to me. Right, where it’s like, I don’t quite understand AI. And then I find out that you’re, you want to make this vehicle break in front of my child, but you’re using a bunch of, you know, synthetic data or generated data to do that. What would you say to that person who doesn’t work in the data space to demystify this process of using synthetic data to make things safer?

Omar Maher 24:55
I think my immediate answer would be, hey, does it really lead to safer self driving in that case or not? If the answer is yes, then I would personally care less about what might have led to that, right? Like whether it’s synthetic or real. And that’s the real benchmark here. The kinds of tests that these companies are doing are just tremendous, they do a ton of tests on real world data to measure the accuracy and safety of these cars, right? So, for example, like most of them would run like daily regression tests or multiple scenarios, to see if the accuracy of the car detecting different objects and pedestrians and kids and you know, vehicles is increasing or decreasing. And that’s the real benchmark that I think we should focus on. If using synthetic data or fake data, or whatever it may be, is leading to performance improvement, on detection in general, and specifically edge cases that we know for a fact are so limited to collect in real life, that I will be happy here, because the bottom line is, this car is becoming safer, or this robot or this drone is becoming smarter, etc. And I think this is one of the moments in life where we should, like, you know, the attention is mostly on the outcome versus the mechanism. So for example, that is a part of that conversation, adjacent conversation about like deep learning being a black box, for example, right? Like, we don’t understand the tone of what’s happening inside deep neural networks. But in some applications, like computer vision, the end result that this network can detect with high accuracy, like, you know, faces or objects or detect possible, for example, diseases like MRI scans or stuff like, that is what matters, right? So I think in summary, I think it’s really pushing the edge and improving the performance and making driving safer. I think there would be less concern over whether this data is real or fake. And the good thing is that it’s easy to benchmark that it’s easy to test that when we test it on real life data and see the performance. Yeah.

Eric Dodds 26:58
Yeah, for sure. That’s how you manage. It sounds like that’s how you manage divergence, right? Like, you control by actually doing, you know, you may speed up the process of learning. But that gives you actually a faster testing cycle to understand how it happens in the real world in testing. And so really, it’s almost an accelerant. You’re not removing the real world element, you’re actually just shifting it to the testing component. To understand how it’s gonna play out in the real world, would you say that’s fear?

Omar Maher 27:35
Exactly. And what’s the music for training tools like, you know, most of the successful experiments slash work that we do with customers, it’s usually a combination between synthetic and real data, right. So usually, it’s much easier to get larger amounts of good data in the synthetic world. So you would train your models, let’s say on synthetic data for a certain use case, and then you would fine tune those models with real data that you’ve collected already, right? Let’s say, hypothetically, you’re using a million images synthetic, and you’re fine tuning with like, you know, 30,000-40,000 images in real world, that usually yields the best performance so far, sometimes synthetic only yields great performance. But so far, it’s a combination between both. But as you mentioned, at the end of the day, when we test it’s mostly on real data, because that’s where the cars or the drones or the robots are going to be interacting at the end of the day, sometimes using synthetic data for testing too. I can share with you some of that later. But it’s mostly testing on real data for now.

Eric Dodds 28:32
Fascinating, okay. cosas, I’m, I am enamored here, but I have to stop myself. And I’m actually disappointed that you didn’t stop me before. Because I’ve been going for so long. Please jump.

Kostas Pardalis 28:45
It’s okay. You can continue if you want, like, it’s fine, it’s fine. Like we’re doing first of all, like you are having an amazing conversation show. It was like, great pleasure for me to listen to all that stuff. But I have a question based on this conversation, actually. And, Amara, I’d like to ask you, you mentioned labels many times, right? So we like the data and then the data needs to be labeled. So we can go and do the training and all that stuff. Would you say that like synthetic data? In the domain? You are primarily working with at least right like with, I would say more of like, I’d say that like visual information like radar data and like all that stuff. Would you say that like synthetic data is the process of starting from the labels and generating like the images for example, is this like an accurate way of like defining and describing, let’s say, synthetic data,

Omar Maher 29:46
like defining the labors and working with the lib is definitely a critical piece in the pipeline. It’s not necessarily the starting point, though, right? Like, like the conversation can definitely start from labels right? Like we, many cases were covered. As for our customers, hey, I have a problem with that model that is trying to detect, let’s say, the depth of objects. So we immediately understand that we’re going to need to provide labels for depth, which is easy to do with synthetic data, right? But in reality, when you start generating synthetic data, there is a whole pipeline like slash process that you go through, ranging from luck, you know, accurately, you know, mimicking that location, so, whether it’s urban, suburban, or highway, for example, so, imperiled domain, for example, we support a lot of maps a lot of locations, and then there is a piece for procedural generation where we start using procedural generation as to generate like, you know, these buildings and you know, structures on all that and generate, you know, the, you know, the agents themselves, the pedestrians, the vehicles, so, there are a lot of technology in place already. This has been used throughout the previous years to generate this, there is another piece for rendering where you start to like visually rendering those, you know, pieces in the puzzle, right, so we get under the buildings, the grass, you know, the objects, the, the pedestrians, the vehicles, you and that’s the kind of the graphical piece, where do you want to make sure that these look as realistic as possible? And on top of those is the labor piece? How do you want to label this data? Do you want to go with like only one type of label like the object to the bounding boxes, for example? Or do you want to do more, like, for example, you know, depth, optical flow, 3d bounding boxes, you know, motion segmentation, or like, things that convey emotion like optical flow, or sometimes it’s like instance segmentation, where you want to only label the boundaries of the object. So I think labels are pretty critical, they usually come on top, or as part of the whole pipeline, which includes, you know, location, you know, procedural generation, simulation of simulation actually is very important. So we have whole teams working on how the reality should be simulated, like how those agents will be interacting in real life, because he can have some pretty images showing pedestrians, etc. But if you want to do sequential, or you want to have like realistic images, you want to make sure that the pedestrians, for example, in this scenario are behaving as pedestrians in real life, the cars are behaving, you don’t want to find cars, like, you know, driving on buildings, for example, right? So having accurate simulations, both for the agents in the world and for the sensors, because we spent so much time, we have spent much time to accurately mimic a lot of the sensors that our customers are using. So how do you simulate, you know, different types of cameras, different types of LiDAR, sensors, different types of radar sensors. And on top of that, there are the labels. So as you can see different pieces working together to generate like, you know, hydrated realistic data, because that’s important, if you want to use it in conjunction with a real word there.

Kostas Pardalis 32:44
Yeah, that makes total sense. So it sounds like there’s some that’s a little bit better. I mean, I assume like, Okay, if you have a modern car, even if it’s not, like let’s say a car like this, and not a pilot’s like or anything like that, it sounds like a couple of different centuries, right? Like things that can measure distances, some of them have cameras, some of them are coming like with LIDAR. So when we are talking about, like, generating synthetic data, because the first thing that someone can visualize about that is that we recreate the video, right? Like, or like an image that has like a scene inside of like, I don’t know, let’s say, a car violating reds, or something like that, right? But it’s interesting what you say, because it’s not just, let’s say that you are trying to simulate the environment, you’re also trying to simulate the perception that the machine has, right? There was a little bit more about that. But where are the boundaries? And where the generation actually starts? Is it let’s say, do you do something like, you create the scene, like in without considering them sensor? And then you also put the sensor and it’s like the output of the sensor? Like, how does this work? When we are talking about this type of data?

Omar Maher 34:07
The theory question. Actually, I want to answer this question, but I wanted to do something first. That would lead to the answer to that question, which is like the domain gap. Okay, so the domain gap, it’s a pretty common, you know, thing, that almost every you know, perception engineer, who has been working with synthetic data knows about this is that think of it, this is the kind of the gap between the real data that the models are going to be operating against at the end of the day. And the synthetic data that we could be using this gap actually could happen for multiple reasons. It could be like, on the visual aspects, like the data would look different. And it could look different for multiple reasons. Let’s say the graphics for example, engine of the synthetic data isn’t of high quality, so it would look like graphics versus reward data. The lights might be different, the weather conditions might be different, the agent distribution in real life For example, in the area that you would like to launch these models, you might have, like, a high amount of a high density of pedestrians and vehicles. So if you’re training with synthetic data with low densities, it’s not going to be of much help, right? Because you would like to mimic the real world. So I guess what I’m trying to say, the gap between synthetic and real is a big topic. When it comes to synthetic data. And closing that gap, or bridging that gap or shrinking it is usually like a major thing, you would try as much as possible to make it minimal, because that does impact the model performance. So in summary, we simulate the world itself. And we then simulate the sensor, like, where is it placed? And what kind of sensor is it and the angle of view and all of those things, and we simulate where it is placed on the vehicle. And the output of that is that you have realistically simulated worlds, you have realistically simulated vehicles and placement of that sensor. And we simulate the intrinsics of the sensor itself to like, you know, the specifics of what goes inside that sensor. So you’re simulating all these elements together. So the output is highly realistic images, or scenes captured from that very sensor in that very location within that world. Right. So let’s go into some detail. So first these like you simulate the actual world. So this goes back to the pipeline thing that I was discussing, right? So first of all, we started with the location: it is urban, suburban, or highway. So we have different apps for different locations. For example, a lot of our customers like to test these vehicles in the Bay Area, for example. So we have maps for San Francisco for all the streets and stuff like that. So we can mix and match some urban and highway scenes, for example, there. And then you start adding the different pieces of the pipeline on top, you know, the simulation, to nicely simulate the agents, the rendering to visually render those agents and structures, et cetera, et cetera. So that’s the kind of world built, if you think about it, so you build an actual simulated world. And within those worlds, you have the vehicles, one of those vehicles is the ego vehicle, which is the vehicle that the model is going to be deployed in. And think of it as the kind of the ego view of that vehicle driving, right? And then we have different models of vehicles, depending on the dimensions of the vehicle, etc. And then one thing our customers can configure is, where would they like to place the sensor? Like, what is the actual sensor placement is at the front of the back of the rear? You know, what, where exactly what’s the location? So that’s usually something that we do. And once you do that, we also simulate the intrinsics of the sensor. Like how, what model exactly, how it works from inside, etcetera. And we start simulating the kinds of scenes that will be captured from that sensor, right. So in that case, it will be as realistic as possible to the actual data captured by the real sensors in the real world. Because we simulated almost everything in that pipeline, the world, the sensor, the placement, the location, the vehicle, and that would result in data that looks similar to how the real world data would be captured from that sensor. So for example, sometimes our customers have a fisheye sensor, which makes the image work, and in some way has some advantages for machine learning applications. You can capture, you know, wider angles and the text. Some others, you know, have, like, you know, like normal camera sensors that are not that wide. So we take that into consideration to generate realistic images. It’s a long answer. Sorry for that, but

Kostas Pardalis 38:31
no worries, it’s, it’s a really fascinating topic, to be honest, because it’s, I don’t know, I think like, it’s, it’s one of these things that it’s easy, like to visualize because it is visual, right? Like, we walk outside and we take for granted that we recognize everything, like we can understand depths, you know, like, we can do all these things that, okay, we generate all these synthetic data to go and train the models to be able to approximate what a human does. But there’s like, so much information in the scenes, right? And it is like, culture that likes it is kind of amazing to think like how complicated it is. And how much information like has to be managed there. And like how you do that in order to approximate reality at the end, right? And, like something related to that, like you mentioned that you are, in a way, let’s say like San Francisco, right? You have, I’m sure, probably one of the most like I don’t know, well documented in terms of data cities in the world like for these kinds of scenarios. So this model that you like this simulation that you are creating, right, how realistic it is, like, if I as a human like would be watching a video of these right like on my on my laptop, like how realistic this would look to me.

Omar Maher 39:58
We can actually play the game. I can show you some images real or synthetic from parallel domain and you will judge, I would, my guess is that at least 50% of those you will not be able to define, okay, because this is not like not trying to brag or anything, but we have a lot of like dedicated teams, and people just trying to, you know, perfectly simulate, like lighting and optics, for example, you know, how, what’s the, how would this visually look like with different kinds of lights, whether it’s natural light or artificial light, you know, people artists, for example, working on specific how the looks of different agents and animals and cars would look like etc. And then there is the behavior of these agents too. So I think in summary, we, I think most of the images that we generate, in general would look as highly realistic as possible. Sometimes when they look super close, especially like with pedestrians on something. He guessed that this might be similar to games, and synthetic, not real. But I would say in many cases, you would not be able to differentiate between both. And that is generally required to be as highly realistic as possible. And what goes inside that is like, you know, behavior simulation, and visual, accurate visual rendering, and, you know, trying to bring in the impact of things like light and weather and a nicely simulating, you know, rain and stuff like that. Okay,

Kostas Pardalis 41:24
That’s super interesting. So do you see, first of all, there are like other domains out there that they are trying to do for different reasons, like similar things, right? Like, you have computer games, for example. It’s one thing, right? So then you have, like, CGI, like in movies, right? Moral, like the entertainment side of things, but do you or like VR, right? Okay, let’s say, even if you could generate, like, SaaS, like, detailed representation of reality, probably the problem with VR is like the fidelity of the hardware right now. But outside of these, what’s the overlap between, like, the things that you are doing? And these domains? How many, let’s say, best practices? Or what techniques or whatever are coming from there? And how can what you’re doing give back to them? By the way?

Omar Maher 42:27
Yep, that’s a great question. Actually, I want to start by saying that I think about 20%, or something, if not more of the engineers working in parallel domains, or the teams in general, are coming from a gaming background. So they’ve been working in, you know, building, you know, games and simulation engines, etc. And I think that is a huge benefit that we get from these established industries, because they have already been building technologies, whether it’s for, you know, simulation, or rendering or visualization or graphics, CGI, procedural generation, etc, that we can leverage to build these simulated worlds. Because we can think about it, we’re trying to do something similar to what they’re doing, maybe for a different purpose, which is enhancing machine learning models in our case, versus entertainment. So I think we are standing on the shoulders of giants, because we’re pretty much using all the advancements in that capacity. On the other hand, we do two things. One, we bring in the advancements in content generation using generative AI, which is, as you know, like exploding these days. So we are using, you know, different generative AI techniques to scale content generation. So instead of like, manually building these assets, and crafting these assets, etc, there is a way that you can, for example, ask generative AI to come up with 100 different variations for road debris that you don’t need to create manually with the right prompts. Right. So techniques, like stable diffusion, for example, is important, a lot of that, right. And diffusion techniques, and we are using a lot of that, we’re using this to create different pieces of content inside our virtually simulated worlds at a scale that could not be possible with you know, traditional means. So bringing both worlds the advancement in the gaming industry and simulation and CGI, etc. Along with AI. The advancements in AI making both work together, in my opinion, would lead to the best results. And that’s what we’re seeing inside. We are actually generating different content using generative AI techniques like stable diffusion. And the data generated content has resulted in improving a lot of the models performance that we’re seeing, right. So detecting some objects, actually, the detection accuracy increased after it was trained with content generated from both the existing engines that we have, and you know, genitive AI techniques. Regarding your question of like, could this go back to the game gaming industry? I think absolutely, yes. Maybe we are not actively doing this as a company today because our major focus is Working with machine learning teams. But I cannot see a reason why the advancements that we’re building would not feed into the gaming industry. I think it can go both ways. Okay.

Kostas Pardalis 45:08
That’s awesome. All right, one last question from me. And then I’ll give the microphone back to Eric. So, okay, we talked like primarily about synthetic data in what do you call it, like the person has like to do with perception. So it’s more like visual, let’s say, in a way. But data can be many different things, right? We can have textual data, obviously, we can have, let’s say tabular data, we can structure that structure, blah, blah, blah, like so many different types. Right? What do you see happening when it comes to all this technology, like synthetic data to other data domains? Right, like, do you see opportunities there? Do you see needs there? And yeah, like, that was a little bit more about that, how we can generalize like what you’re doing to other types of data

Omar Maher 46:02
to? Absolutely, I think synthetic data is pretty much required across the board. To be honest, I think any domain slash vertical that it’s hard to get, like high quality level data to train machine learning models, it’s pretty much gonna require. So let me give you some examples. In other careers, I work with health, and getting access to health data on the individual level, for example, super hard, right? If not impossible, due to privacy concerns that are very well understood. In that case, I think synthetic tabular data would make perfect sense for health related scenarios where you’re trying to build machine learning models that would detect fraud, waste, and abuse, for example. So fraud, waste, and abuse happens and costs the US alone, like more than I think I remember the number correctly more than $100 billion annually, right. And there is a lot of demand to apply analytics and machine learning to detect fraud, waste, and abuse. And using that with real life data is not so feasible, because you have a ton of researchers across the country who can do innovative work, but they might not be able to access this data, due to different reasons. And giving them highly realistic synthetic data that mimics the real world would enable them to build these kinds of, you know, fraud detection algorithms. Same thing happens with financial data and transactions, right? If you want to build fraud detection systems, if you want to enable researchers to know that you can give them data like private personal data for financial transactions. So the same thing. I can think of many other cases, but also with unstructured data, it’s not just computer vision, let’s say with speech. Right? So I think if you want to build models, that would be great if you know, transcribing audio to text, you need a lot of label data, and how can you get that not a lot of companies had that. So having highly realistic and labeled audio data would help with that. I can also think of other domains, like, you know, text, I mean, we are seeing the explosion that’s happening with GPT, etcetera. And it’s amazing because it’s using self supervised techniques, right. So it’s trying to predict the next world. And this is already included somehow as labels. So it does not need explicit labels, unnecessarily to these tasks. But I think for things like tabular data, things like, you know, financial transactions, health insurance, the same thing with, you know, ODU data, I think these are all examples of domains that would benefit from synthetic data.

Kostas Pardalis 48:30
That’s awesome. Thank you so much, Eric. It’s all yours again.

Eric Dodds 48:34
All right. Well, we are at the buzzer here. But Omar, question for you. And I hope you’re okay with this. But I want to just dip our toe into the water of the ethical question, you know, as it relates to AI, and I think specifically, you deal with a very interesting component of this, and that you have a lot of experience with sort of labeling. And traditionally, labeling has been handled by, you know, a large human workforce. But companies like parallel domains can create synthetic data, and automatically label that right. And so there are lots of different opinions about this. And, you know, we’re not a, you know, we’re not a show that has expertise on economics, or, you know, anthropology or politics. But you know, you know, maybe you are employing people who, you know, labeled a bunch of data and now parallel domains doing that. How do you see that playing out? Because there are a lot of people that think this will create new career opportunities, some people are worried, can you help our listeners and us think through the impact of that because you’re really kind of on the bleeding edge of this and probably have already seen some of it, you know, within the realm of labeling

Omar Maher 49:53
even? Absolutely. I think in short, we’re still definitely gonna require all enablers to help with improving machine learning models and making them safer. It’s just the nature of the work could differ a bit, right? So instead of spending a ton of time, for example, doing bounding boxes around objects or, you know, doing, you know, sophisticated semantic, you know, instance masks on objects, which takes a lot of time, these efforts are going to be required into, let’s say, higher level forms of quality assurance, for example, right? So, for example, when you start training those models and running them in the real world, you come up with no different performance degrees, right. So having someone to understand where these detections are missed, and where, you know, where do we need to improve those models, this is still going to require some level of supervision and quality, right. So that’s one thing. Another thing is like, you know, providing some sort of quality assurance, and monitoring for the synthetic data itself. So even with highly stable synthetic data engines, when you generate the data, sometimes there are problems, sometimes those are things that you don’t like to be there, or that you would like to modify. So this still gonna require some level of supervision. So you see what I’m going to think we still have a lot of tasks, a lot of need to employ this expertise. Maybe in a place that will lead to better results or faster results, or safer, you know, you know, autonomy. So I still, it’s just, and this is similar to any kind of technological advancement, you start to have some jobs, you know, shifting to have different shapes and structures. Sure. So, that’s one thing, the other thing, I think we’re still going to acquire real word labels anyways, like I don’t see, in the near future, at least, that synthetic data will totally replace real world data like 100%, I think we’re still gonna need some level of real data, whether this is like 10% 20% 30, we can debate that or even better, we can find out. So I think we’re still gonna require it. So yeah, that’s my,

Eric Dodds 52:05
I love it. Well, I mean, the real world changes. I mean, you know, we don’t live in a static world. And, you know, so managing the dynamic nature of reality, I agree, certainly requires a human interface. Well, Omar, this has been absolutely fascinating. The time flew by. We have many more questions. So come back and join us. And thank you for the time that you’ve already given us.

Omar Maher 52:36
Thank you so much for hosting me. I really enjoyed the discussion. And thank you for reminding me of the beautiful moments of my early data carrier actually. Starting point.

Eric Dodds 52:46
Thank you for sharing those with us. That was special.

Omar Maher 52:49
Absolutely. Thanks, Eric. Thanks, Kostas. Nice to meet you.

Eric Dodds 52:53
Well, Costas, what a conversation with Omar from a parallel domain, I mean, it’s almost impossible not to have fun, if you’re talking about synthetic data to train, you know, AI models that are driving, you know, learning for self-driving cars and other imagery. Use cases. I mean, just unbelievable. So I loved it. I think one of the biggest things that stuck out to me was, you know, when you scan Hacker News, when you see news headlines, especially related to chat, GPT, and all these other AI technologies, they tend to be extreme. Right? AI is taking away jobs or, you know, you know, we need to stop AI, you know, we need to slow down, you know, we have all these, you know, people who signed a, you know, a letter, Omar was so balanced. You know, he just, it almost felt personal to him where he felt like he could stop a self-driving vehicle from, you know, hitting someone in the street, because he can provide better data, it felt very personal, it felt very practical. There was no mystery. And he had a level of confidence that I think, you know, just really, to me and validated just a lot of the headlines that we see. I mean, he’s really doing work on the frontlines with, you know, companies who are building cars that are trying to drive themselves, you know, and yeah, what was that encouraging to me? I think that’s my big takeaway is that he’s balanced. He’s brilliant, obviously. But he’s very confident that this is a way to make things safer, even using synthetic data. We shouldn’t be scared, we actually should be excited. Now that was a big takeaway for me.

Kostas Pardalis 54:53
Yeah. 100% I think there is a big difference between what Omar is doing and what is happening with things like subs up. In Omar’s case, the use case is very explicit, right? It’s like we know, like, we understand exactly what’s the output of improving these models, right? It’s safer. It’s yeah, like, we’re not going to have so many accidents, we are not going to be worrying, like if a child is going to be hit by a car. And like all that stuff. And, by the way, it’s not just self-driving cars, right? Like, as he said, many of these models are used like today, as part of like other sensors, that cars, kind of like, you know, like, the regular cars that we all like drive, like to make sure that if an accidents going to happen, like the car can help you react faster and stuff like that, right? So it’s not just like the extreme of having, like a fully autonomous car, right. And on the other hand, you have GPT, which is like a very impressive and very accessible technology, right? Everyone can go and type on this thing. You don’t easily go and get a fully autonomous car and see it like that, right? So but people aren’t the same. They see something big, something great there. But they don’t necessarily understand what the impact is going to be. We’re still trying to figure out this right. And that generates fear. So I think that’s like one of the reasons that you see like, starts like a diverse, let’s say reaction between like the different technologies, and I having the same side like I agree with Omar like at the end like we are going to see like something similar happening with all like the jpy kind of technologies. That’s one thing. The other thing that I’m going to keep, like from this conversation that we’ve had is that there’s so much overlap between, like the state of the art that’s in Omar’s domain and things like virtual reality, CGI games, all these things and like, I can’t wait to see how these industries are going to be using whatever is happening, like in synthetic data today and generative AI to create even more innovation. So at least you are going to have fun. But that’s willing eight gets

Eric Dodds 57:20
the minimum we’re gonna have fun. Yeah. And I think people like Omar, are the right people to be working on this because of their disposition and value system. So thank you for joining us again on The Data Stack Show. Many great episodes are coming up. If you like this one, and just wait, we have so many more good ones. Coming out soon. If you haven’t subscribed, subscribe on Spotify, Apple podcast, whatever your favorite network is. And tell a friend if it’s valuable to you, and we will catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.