This week on The Data Stack Show, Eric and Kostas chat with Sammy Sidhu, Co-Founder and CEO of Eventual. During the episode, Sammy discusses his vast experience building tooling products that were acquired by Tesla and Toyota. The conversation covers data tooling, deep learning, the state of self-driving technology and its adoption, Sammy’s journey to founding Eventual, and more.
Highlights from this week’s conversation include:
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 00:03
Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run by the top companies. The Data Stack Show is brought to you by RudderStack. The CDP for developers you can learn more at rudderstack.com. Kostas, I’m so excited that the people that we get to talk to you unless you continually amaze me, we’re going to talk with Sammy, who’s building Eventual, but he was deep scale, self driving, AI technology acquired by Tesla, he was at Lyft built the level five team, they’re working on self driving stuff, acquired by Toyota, the team was acquired by Toyota, and is now building tooling to help people who do those sorts of things. You know, I have a much better experience shipping sort of large scale projects and models around complex data. And what I want to ask me is, you know, he’s had a dedicated focus on a very similar type of problem over a pretty long period of time. And if you think back, we talk a lot about the recent history of data technology. But if you think back to sort of, I get to 2015, when he started at Deep scale, you know, there were still a lot of limitations in terms of running models at scale, and, you know, data storage and all this other sort of stuff, right. And so I can’t wait to hear the trends that he thinks are most important from his perspective. And then you know what he’s building it eventually, based on all that experience about you? Yeah,
Kostas Pardalis 01:51
I have plenty to talk about with him. Like, first of all, we have to talk about frames we have about pandas, the Python ecosystem around data. And I’d like to hear more about like, the ventral itself, like, what it is, and what’s the vision, it’s why they build this thing. So I think we are going to have a very interesting conversation, especially because we have a person here who has done all these things around training models, and building models and self driving cars. And like all that stuff, and who today’s like starting from scratch, like combining has to do with developer tooling and data warehouses in ML, but I think that says a lot about like, current state ops like to link and technology, but people have a meltdown. And I want to hear that from him, like, I’m very interested to see what’s going on and why he made this decision. Right. So let’s do that. Let’s look at him.
Eric Dodds 02:55
All right. Well, let’s dig in. Let’s do it. Sammy, welcome to The Data Stack Show so much to talk about. So thank you for joining us. Glad to be here. Okay, you have an amazing story. And you seem to have this knack for getting acquired by large automotive corporations, you know, pitches really fun times in a row. But take us back to the beginning. So can you just give us sort of an overview of, you know, of your journey and data?
Sammy Sidhu 03:27
Yeah, for sure. So it’s again, I’m Sammy, it all started when I went to Berkeley, right, focused on high performance computing, aka making things run fast and deep learning. So this is kind of the era where deep parts are taking off, neural networks are starting to seep into things. And I had found a research lab that focused on putting the two together. So I worked on everything from making models train really quickly on large supercomputers to making small neural networks. And my professor and PI, who I worked with, had the idea of starting a company. So he started a company called Deep skill, which I joined as one of the early engineers. And during that process, we started with three people to where we ended up with nearly 40, and I was a CTO towards the end. So there I worked on everything from building deep learning compilers, training new novel research models, to building entire data engineering stacks to process things like point clouds, images, radar, you name it. Towards the end of that we actually got acquired by Tesla, the autopilot team, where the majority of my team got absorbed into autopilot working on training models or building infrastructure, taking a lot of learnings that we developed a deep skill at the time. After that whole ordeal I went to the left where I continued working on self driving, where they were a little bit earlier. I brought everything I’ve learned from before and trained better models for perception, built entire gate engineering pipelines, and really refined that process of how do you actually ship for sale? With driving, I did that for about three years. And my team was co hired once again by Toyota at this time. After that, I was like, hey, you know what, I picked up a lot. And I’m really kind of tired of the tooling. I’ve had to deal with using systems that were designed for tabular data, things like Spark and BigQuery. And kind of adapting it to make it work with images and point clouds, and whatnot. So as a hey, things can be a lot better. And my co-founder and I, who I had met at Lyft, decided to start Evangel to kind of build a data warehouse for everything else.
Eric Dodds 05:31
Love it, and so much to ask about the eventual one, one thing I’d like to start with, because I think it’d be really helpful for a lot of our listeners, and I’m just plain curious. Going from being, you know, sort of a multipurpose engineer, obviously, with an emphasis on sort of, you know, infrastructure that’s feeding, you know, sort of heavy duty data science stuff. Very practical, wearing lots of hats, like an early stage company, and then you become CTO. What was that transition like? And what was sort of, what were the things that really stuck out to you sort of becoming CTO as opposed to, like, I’m executing a lot of practical stuff day to day to like, you know, get the software work?
Sammy Sidhu 06:21
Yeah, it’s an interesting shift. When you’re an engineer, the kind of fulfillment you get is like, what did I accomplish this week, or today are what features that I should. And then when you kind of transition into like a manager, right, or a tech lead, it’s like, what features are what product were impacted my team ship, so you kind of have to like, change your mindset a little bit to get more fulfillment and happiness from that. And then when you’re CTO, it’s even one degree removed from that, which is, what is my company? Shipping? How are they making their customers happy? And the thing that you can measure is, are the decisions I’m making now, which have an impact a year from now? gonna pay off? Right? So you kind of have to modify the discount factor in your head for every step of the way. Yeah, yeah, that’s
Eric Dodds 07:07
super tough. Was it like, did you sort of, was it a steep learning curve, sort of thinking about that discount factor and having to think, you know, much further ahead than you ever had in sort of more practical engineering roles?
Sammy Sidhu 07:22
I think so. Because in the beginning, when you transition, it’s like, you still try to do a little bit of your old job while doing some of the new responsibilities. And what you end up doing is a bad job at both. And so you kind of have to, like, learn how to like, Okay, I have a team to kind of manage the things I used to do, I need to now focus and get better at the things that are now more important for me.
Eric Dodds 07:42
Yep. Yeah, that makes total sense. Thank you for indulging me on that. No, it’s just that’s, you know, sort of like engineer to manager to CTO, as you know, those are just very different. As you said, I’d love to get your perspective on the sort of the problem space that you operated in, and are still operating in over a pretty, you know, long period of time, right? So you were building a lot of this stuff back at Deep scale. And then you sort of saw that through to like, you know, lift and then of course, are now well, of course, you’ve worked on Toyota. And now we’re building your own thing. And so, over that span of time, you know, there have been a number of things that have changed, right? I mean, the amount of data available to us, or the amount of non-t tabular, like complex data, seems to me to have grown a significant amount in that time. I mean, you actually even have like, deep learning, generating, like, a bunch of that net news. So it’s, which is really crazy to think about. It’s not just like capturing, you know, images, the real world, but then also the technology’s changed a lot, as well, right. So like infrastructure, you know, multiple new technologies in sort of the data science, space have come out. Other things haven’t changed. As you look back sort of over your experience, and maybe even sort of tie that into things that led you to founding eventually, what are the main changes you’ve seen over that time period since you started at Deep scale?
Sammy Sidhu 09:18
Yeah, it’s really interesting. I think there have been step functions along the way. So I think initially, when you know, you’re talking about 2015, 2016, the model you can ship, the model that you were billed for during conception, or whatever tasks you were doing was kind of limited by how long it would take to train on a single machine. So it’d be like, you’d have a server racked up somewhere in your office, and you typically want to wait more than three to four days to train it on. And that kind of set the limit of how much data you could train on and how big your model could be. And at the time, you would kind of just like your iterations time would be limited by that. So you have to be very careful with what you trained and the data that went into it. And then kind of the next step that happened is things like distributed training became very ubiquitous, we had frameworks that made it a lot easier, we had people who had more expertise with it, and the volume of data was now proportional to how many GPUs you can train on in parallel. And with that, that gave us a huge explosion in both the amount of data we can train on. So we went from things that were, you know, 10s of 1000s of images, or 100,000 images to now millions of images that we train on any given time. And then the second thing was that our iteration time went from like, something like four or five days to maybe a day, or sometimes even hours, that you could actually crank out a lot more iterations of your model. And then kind of self driving, the thing I believe, is, the person who gets the most iteration essentially wins. Because every time you can do something, get feedback and improve on that. It’s kind of like the sharper journey that you can take to improve your overall system. So that was the next step. And then we kind of hit a point where the models weren’t really advancing anymore. Or another way to put that as the models were no longer the bottleneck, before there was a point where if I just, you know, change my model to the latest and greatest paper, I would just get like a, you know, jump in performance. But we kind of had a point where that didn’t make a difference anymore. And the point that was really crucial was what is my data quality? Right? How can I improve my dataset? Are things badly annotated? Do I have examples with conflicting truths? And that’s where the data game became really important. So that’s where we were, instead of training, reading or changing the model, we would just dive into a dataset, find the failures, figure out what to do with it, and then iterate on the datasets. And these data systems became
Eric Dodds 11:51
very important. super interesting. And one thing you mentioned that I’m interested in digging into a little bit more is that, you know, back when you had these, let’s call them like, physical constraints, you mentioned that you had to be really careful, which makes a ton of sense, right? Because you’re trying to maximize the amount of money you’re trying to maximize every hour that this bottle is running. And you’re still trying to create a fast cycle. Do you think anything? Was the law like you had to be? Like, did you have the license to be less careful? Like in a world where you can just run this thing on so many GPUs simultaneously? And like, you don’t have that physical constraint? Like, do you have to be less careful? And then are there consequences for that? Where is it like? Well, I mean, maybe that was helpful in some ways.
Sammy Sidhu 12:45
It kind of was like, the analogy I draw in my head is, if you think about programmers way back in the day, where if you wrote code and ran it through the compiler, and it took forever, you have to be very careful what you decided to compile, right? Or things if you’re putting into a punch card, you have to be very careful what you’re doing. But nowadays, we kind of just write code and smash compile, and then we get an error warning right away. Yep. That’s kind of the same question of like, Oh, we’re programmers back in the day better? And my answer is like, I think programmers today get to focus on higher level concepts, rather than the exact like doing the job of the compiler, I kind of view it the same way, which is, this is kind of the having just a lot more GPU with a lot more compute is simply just trading off like human time or computer
Eric Dodds 13:26
time. Yep. Yeah, that makes total sense. That makes total sense. That’s such a good analogy. And one thing I’d love to, we actually have had, we had a guest on the show previously, who had also worked in the self driving space, which is really interesting. And one conversation we had with him. This is actually this has been a while ago, I think cost us maybe a year and a half ago. Peter from the Aquarium, I think. Yep. And oh, Peter. Okay, great. And lovely. Got her. Oh, no way you work in the same lab. Okay. That’s awesome. I wasn’t aware of that connection. Okay, this is great. So we’ll get an updated take from you on Peter six, from awhile ago. So there’s been an immense amount of work that you and Peter both have personally invested into the self driving space. But for the common person out there, the news headlines tend to lead, like the practical experience of self driving. So we were like, Could you describe the state of self driving from the perspective of someone who’s like, built literally like core technology that’s enabling this to become a reality in terms of like, mass adoption of this or sort of, you know, implementation? Yeah,
Sammy Sidhu 14:44
It is interesting. It also depends on what part of the landscape you’re looking at. So on one side, you have, I call it the bottom up self driving, which is things like, you know, if you buy a car five years ago, it would have been Here’s like aiibi, which is automatic emergency braking. Like if you’re going too fast or something you would break automatically. So that was kind of bare the bare bones, safety features that you would have in a car. But now you have things like adaptive cruise control, automated lane keep, and it’s kind of going higher and higher every year. And then you have the other front, which is called top down, which is you’re having things like Robo taxis with no steering wheels, and eventually that technology will trickle down to the everyday man. So they’re both making progress. But if you actually think about, for a, you know, the bottom up approach, the progress can be gradual, right, you can have a car every year where the features get better. And I think Tesla has shown this example quite a bit, which is it does more places that it’s a little bit better. And you have companies like Mercedes, and Audi also putting out these features. But for the level five thing where I get into a taxi with no driver, no steering wheel, that’s like a binary threshold yet or works where it doesn’t. Yeah, so for that one, it’s, I do imagine one day, we’ll pass that threshold. But for now, the bottom up approach is the stuff I did at Deep scale, and Tesla, I feel like is making the most progress.
Eric Dodds 16:11
Yep. Are the technologies that drive both of those? What’s the relationship between sort of the underlying work, like, was the baseline work feeding both of those efforts? Or are they approached pretty differently?
Sammy Sidhu 16:24
They’re completely different. So if you’re just tackling highways, it’s actually a very simple system. And we can actually build systems that can drive a 90 was in the highways in the US without too much difficulty. Wow. Yeah. Wow. So it’s not too bad. And we are seeing cars that are starting to do that. Yeah. However, when you start thinking like, oh, I want to do the off ramps on ramps and the cities, then that’s when things just get very hard. Like, I think self driving is the most insane case of the longtail I’ve ever seen where the last 10% of work is just ungodly. Yeah.
Eric Dodds 17:01
How did you do? I mean, certainly part of that last 10% of work is dealing with actual changes, right? I mean, maps have actually gotten incredibly good at incorporating user feedback, even in real time. Which is amazing, right? I mean, you’ve seen this happen over the last couple of years where like, you know, Google Maps and Apple Maps, right? Well, prompt you and say, like, is the wreck still there? Or whatever? Right? You know, or is the construction still there? And so those feedback loops are amazing. So those maps are getting better. But when you like, how does that change? Right? Because traffic patterns change, construction happens, all of those like incorporating user feedback into like, navigation instructions. So like, quickly recalculate, like, which street you need to go on? I mean, it is related to that. But when you have a car with no steering wheel, like, do you approach that differently? I mean, they’re still using map technology, but it’s pretty different, where it would seem
Sammy Sidhu 17:59
different. Yeah, so the mapping technology is interesting for that, but also just even general perception, just trying to understand what’s going on around the car. It’s quite interesting, because for the case of not having a steering wheel in the car, you kind of need to get to a critical point where you can adapt to the change in the world, as fast as a change is happening in the world. So I’ll give an example when I was working in self driving, one thing that came out of nowhere was we had models and mathematical models to kind of represent the motion for pedestrians in the street. How a person would walk on the street, or if they might be on a bike or a motorcycle. But then suddenly SF had 1000s of electric scooters. Oh, yeah, people would now be on the sidewalk. Road. Crazy. Yeah, yeah, it was wild. And you see all these random screws in the street. And then now your whole set of priors before are now useless. Yep. Right. So you now have to adapt to that change rapidly? Yep. And so that’s kind of how it is self driving. Why it’s so hard is that the role isn’t static, it changes. And you need the, you know, be able to keep your data loops, your model loops, your ability to ship as fast as the world changes.
Eric Dodds 19:13
Yep. Yeah, that makes total sense. The one really specific question is, did you work on distributing models to like distributing updated models to the actual fleets themselves? Right, because like, you have this challenge of getting data to update the model, but then you actually have to redistribute that to the fleet. Is that problematic? Or is it actually pretty streamlined? No. I would say it’s pretty streamlined.
Sammy Sidhu 19:40
The hard part is doing everything. From the point where you’re training you have a trained model to being like, I can safely deploy this. So you have to do like, simulation, you have to simulate this model and a whole stack on 10s of 1000s of GPUs, and simulate over all your historical data. And here do things like hardware simulation, and then finally, do a small rollout and then find the full rollout. It’s usually a lot of work and why is self driving so ops intensive?
Eric Dodds 20:09
Yep. Yes. Okay. So no, no listener can complain about QA and
Sammy Sidhu 20:17
you is, I would say a lot of the stuff that’s out there for companies or QA.
Eric Dodds 20:22
Yeah. Yeah, that’s wild. Okay, I could keep going. One more question for me. And then I’m going to hand the mic off to Costas. Do you own a car? And if so, what kind of car do you have because you’ve been Aqua hired by multiple car companies? So I just need to know this.
Sammy Sidhu 20:39
So I’m a horrible person to ask this because I love driving. So I actually drive like a car guy. So I drive like a 1990s BMW that I work on a lot. And then for my daily commute,
Eric Dodds 20:53
I have a Toyota rav4. Yeah, totally. That’s great. I have an old 1985 Land Cruiser that I work on. So I’m the same way. It’s like this is, you know, pretty low tech. But super fun.
Sammy Sidhu 21:06
I mean, the stick shift. I
Eric Dodds 21:08
I enjoy driving it. Yeah. Love it. All. Right, Costas.
Kostas Pardalis 21:12
Thank you, Eric. All right. So y’all have planted something new. That’s called a venture. All right. So, um, tell us a little bit more about like, first of all, like what the venture always and then I’d like to ask you like what’s made you start like working
Sammy Sidhu 21:31
in that? Yeah, the way I would sum up eventually, is that we’re building the data warehouse for everything else. So you could think of things like BigQuery, or presto, or Athena. And these were amazing for things like tabular data, or anything that fits in an Excel spreadsheet. But when you have, you know, 1000s, or millions of images, or video or 3d scans, it doesn’t really work quite well, what we’re doing is building something native for that. And to do that, we’re actually building an open source query engine to help query this type of data. The way I like to explain why we need a new query engine is to think about the first thing we have to do there is think about what kind of the natural user interface is for color data. If you ask most people, Hey, what’s the most natural interface to tabular data, they will tell you SQL, and they will agree that SQL makes a lot of sense for tabular data. But if I’m starting to talk about images and video, do I really want to use SQL to query video or like images or random complex data, it doesn’t really make a lot of sense, but does make sense is having something where the ecosystem is, if you do any type of machine learning or complex data processing, you’re probably using Python, you’re probably using tools like PyTorch, and TensorFlow, and various, like image parsing libraries there. So we’re building a data frame library that’s distributed, utilizing a ray cluster underneath. But it’s native to the Python ecosystem, you can, you know, use your normal Python functions, your normal Python objects. But under the hood, we have a really powerful vectorized Compute Engine written in rust. And also have a powerful query, query engine and query optimizer and all the special things that you would want in a data frame.
Kostas Pardalis 23:11
So it Alright, so again, but it didn’t work for housing, like for imagining and for, like, let’s say data types that are outside of the typical relational model that people have been experiencing, so far. So if this thing does not exist today, right, like how you were doing the work that you were doing all these years, right, like, what, what, what was the current, like, what were the States or the dueling there? And how good of an experience was it?
Sammy Sidhu 23:44
Yeah, so for companies I’ve worked out, or companies I’ve helped out, the first step of the process, which usually does not change is the equivalent of having a bunch of files on your desktop folder, when most people do it, is they have a bunch of individual images or video just sitting in S3, or Google Storage bucket. And in the beginning, that’s completely fine. They use that directly to train. But then they’re like, hey, you know what kind of version I want, what did I have or keep track of some metadata? And so did you end up building a system that first you know, you have the individual files in something like S3, but then you use something like BigQuery, or Postgres to store all the metadata, then it kind of starts evolving from there, which is, okay, now I need an easy way to access the data, then you typically build an abstraction on top of it using like a workflow engine. So now you have something where it’s like, the tower data is here, the complex data is here, you have a pointer that points to it. And then you have something like airflow or spark to kind of process it. And then winds up happening is that this spaghetti gets piled more and more on top two, eventually, you have three teams managing just one system that’s very limited. And what we’re trying to do is kind of build an engine that kind of bridges the two which is you can process all of Tubridy that you need, right really expressible queries, but then also have things like image columns, video columns, and you can kind of interact with both in one place. The next step for the data warehouse is how do you actually store the data together, such that a point where things don’t go stale from one another, things are versioned together, and the schema is tied together as well.
Kostas Pardalis 25:17
So what made you like to leave the work that you were doing? Right? Like training? The models were like, a lot? Like turning to like, two weeks, right? Because my feeling is that from what I hear from you is that yeah, like, okay, foundational work goods getting done. Like it’s amazing. But we have reached probably a point right now where I’ll slow down, because like, the tooling is not there yet. Right. And I think like for anyone who has been like an engineer for some time, and regardless of what engineer, you are, right, like, deadlines in the front end engineer, or whatever, like, you can see that not every fields, let’s say, has access to the same type of tool, right? Like, what is happening, like with front end development, for example, like the tooling that is out there, like all the different I know that like people keep, like, let’s say, complaining all the time about all the JavaScript libraries, and like, all these things, but at the end, that’s like also an indication of like growth and like progress in terms of like the tooling rights, like in my feeling, at least, if you compare, like the tooling that like a data engineer, or an engineer has like to take, compared like to the tooling that front end engineer has, like, there’s no comparison there. Like it’s, it’s a very different experience in terms of like, how much the tools helped you, like, do your job. So what’s like, how big of a problem do you think the entity is ? I mean, okay, it’s like, I guess, the answer is easy, because you move to own and like you’re building a company around that. But I’d love to communicate, like to the people out there, like the complexity and like how much of a problem it is for not having the right tools out there to do your job?
Sammy Sidhu 27:00
Yeah, that’s a good question. I would kind of put it this way, which is, for the problems you mentioned, for the front end. And also believe for tabular, there’s kind of a path for graduation, if you will, right. If you’re starting off with a data science project, or doing some data engineering, and it’s just small sets of data, you can use pandas, and you’ll be completely fine. And once you need to graduate to like, Hey, I have more data. Now it’s taking too long, I can’t fit on one machine anymore. You have tools to go to your data warehouses, you have things like you have a bunch of different tools they can go to. However, the kind of the domain that we’re tackling, there isn’t really a path, besides, I build a custom infrastructure for my problem solving. So what’s happening is, that barrier can actually slow your progress quite a bit. So what we see for a lot of startups and people just doing projects is that their dataset size that they can process and kind of comb through is completely limited with whatever they can process on one machine. And if you think about what the implication of that is, is that compensation typically is a lot larger so if you’re processing video or images, that’s actually not very much data at all. So what we’re trying to do is kind of build a tool, you know, be adept, to kind of give people a path to graduate, keep people away so that they can start off processing data this way. And when they need to scale up, it will scale with them. And then finally, when they’re a larger company around it, we give them all the benefits of a data warehouse that you typically find in something like Snowflake or BigQuery. Yeah,
Kostas Pardalis 28:30
yeah. So okay, so what’s Duft? Yeah, that’s
Sammy Sidhu 28:34
I’m glad you asked. So yeah, daft is a Python data frame that’s distributed. That’s essentially made for complex data. And well, this data frame data plane is, if you’re familiar with pandas, then or polars, that’s essentially data from a data frame is a way to represent a dataset with a set of columns. And it’s kind of very similar to like when you SQL database, and you essentially get a set of columns in a row that represent those, those columns. The thing with data that’s a little bit different, is that a column can be something you’re used to like an integer, or like a flow or string, but can also be an image or a video, you can do operations, like, Okay, I’m loading in these MRI friends, I data into my data frame, I have things like the patient name, the patient, ID, whatever. And then I could do something like an explosion on the MRI image and actually get individual slices. And I can then natively run a model on these slices, and determine Is there a cancer determined? Like, oh, yeah, what’s the difference between these different brains, you can run all these different operations just using the normal tools you want to use in Python? So give the machine learning engineer or data centers a lot of power in a very simple way.
Kostas Pardalis 29:44
How is this different to copying let’s say, like, the standard data frame or like even pandas and be used like a data type that comes to our binary data, like, I want to get byte array or something like that, right? Because one of the things Not like I have noticed is that like all the stuff that we are talking about here is like primary like binary data, right? So what is the difference? What is needed on top of that to make these experiences better with working with this data?
Sammy Sidhu 30:14
Yeah. So I think the biggest thing there is how you actually represent it in a way that’s efficient. So the actual operations you do or fast? The second thing is what does the user interface look like? And then the third thing is, how do you actually make this scalable, but these are patented, you can put things like NumPy arrays or images inside the panda’s data frame as an object. But the problem is then distributing it over a cluster or natively using, you’re used to like I PyTorch, or TensorFlow is difficult. So what we kind of do is we represent ourselves as an image column and can do things very intelligently under the hood. So for example, if you have something that has an image, and you want to send it across the cluster, we can do things like keeping it in a format that’s most efficient. If there was a JPEG, for example. I think another thing that’s pretty interesting is that if you are going to distribute tools that you have, they are usually not as powerful. So I think the biggest tool that most self driving companies use is Spark. And if you’re using the spark data frame API, one of the key things there is that you can only you’re kind of limited to a type like support. So you could do things like integers and strings, and they do have a, you know, a way to just make whatever object you’re dealing with into just a bunch of bytes. But the problem is, then, that’s not very pleasant for the user. I, as a user, have to constantly convert back and forth from whatever I’m doing to bytes. And then whenever I’m trying to read it back, I can read bytes back to the thing I’m trying to do. And that’s not very nice.
Kostas Pardalis 31:47
Yeah, I would also assume that like, probably you’d say, query optimizers for these systems are probably only used to this type of data, right? Like, they cannot really use the information on like, from a binary array or something like that. And like, optimize the query itself, right? So do you see like, what you want is to add this kind of functionality, if you are more semantically aware of like, what is stored there?
Sammy Sidhu 32:14
Yeah, 100%. So we have a very powerful query optimizer. And after that kind of handled these use cases, I think one of the things that happened, for example, with Spark is, when people do get frustrated using the data from API, which you can only use bytes, they dropped down to the very low level API, which is called an RDD. Yeah, which is just, I have a big collection of rows of whatever, and you lose a query optimizer, you lose all of the like out of core processing, you lose kind of all the benefits of a spark data frame. So for us, we can kind of combine the best of both, which is we have data from API that’s very intuitive. You can use GPUs very efficiently. But we can also optimize quite a bit when you write your queries. So whenever you do calls and daft it’s very lazy. It stacks up and makes a very query plan. And we find the most efficient way to actually run it on your cluster. Yeah, absolutely.
Kostas Pardalis 33:08
And I think it was, like completely the whole purpose of having something in Python, right, which is how easy it is like to work with it when you go to the D level, where you pretty much have to, you know, like, read the publication’s of the folks dude, like back in Berkeley when they come with you out, like, how to work with these to this mean? Yeah, like the point, when you reach the point where you have to write like, not the deed feels almost like, oh, I write Java, but now it’s probably better to use J nine or something like that, to go in there, operate with C and start writing C sharp. That’s not what you want to do here. Right? Like, yeah, sure, like, if you want to optimize, do it, but like, at the end, this is not like, it shouldn’t be like, how everything happens, right? That’s just like, bad experience. Question. You keep talking about images. Right. But I would assume that also like other formats, that’s like other types of data there, right? I don’t know, like a radar, probably not generating images generated by something else. I don’t know, I can have audio, right? Is there support syntax for these that are like the only ones right now?
Sammy Sidhu 34:24
They are supported. So we cannot support like, you know, audio, these different types of modalities that essentially wouldn’t fit in a regular tool that you’re using there. We’ve had some pretty interesting use cases. We had a user who was dumping protobufs from a Kafka stream, or just like S3, and they’re like, hey, I want to just query a bunch of these protobufs without having to, like ingest everything. So what they could do with app is just saying, okay, read all these S3 files, deserialize it using my proto schema and find the ones that have these fields, and rather than building you know, ingesting everything into BigQuery Here are some big heavyweight data warehouse, you could just spin up DAF. And it sends you a variant of four lines of code.
Kostas Pardalis 35:06
Oh, that’s super cool. And you mentioned the query optimizer, right? Like being into consideration, like the time, this formatting. So the types that we are working with, tell us a little bit more about that. Like, how do you build like a query optimizer, with information related to an image, right, like, how does this work?
Sammy Sidhu 35:26
Yeah. So what we found is that there’s a lot of simple operations you do in a query optimizer that kind of give you like, 90% of the speed ups. So things like, what we found is, if you’re processing a data frame, like most data centers, do, you add every column you potentially could use, and you kind of just stack on top of it. And so one of the simple things that we can do, for example, is say, okay, the columns we don’t need, let’s not actually process them, we can do things like, Oh, if she only needs as many rows of data at the end, that’s only read those many, those many rows of data from the very beginning. So doing a lot of these operations can actually drastically speed things up. The thing where it gets interesting for complex data, is, we can actually factor out a lot of the heavy computations outside of the context data. So for example, if I have multiple tables, or data frames of things with images or audio or something, you know, heavyweight, and I want to do a join on something like a key or some kind of timestamp, you actually don’t need to shuffle around the really heavyweight data, you can actually figure out what is the date I actually need to keep or the date I want to emit, and then compute that first and then send over the binary data. So we do operations like these that are much more native or, well, complex data, essentially.
Kostas Pardalis 36:42
Yeah, that makes a little sense. And you mentioned, like distributed processing at some point that I think like also, you mentioned the rave those a little bit more about that. Because from whether there’s not like, Okay, you have we have like, let’s say, the data frame, which is the API, like how like the use of like, interact with the data, but then somehow, like the actual processing needs to happen, right? And how does this work? We’ve darted and ventured, and I don’t know if there’s any differentiation there, but I’d love to hear about that.
Sammy Sidhu 37:14
Yeah, I would kind of break it down into multiple layers. At the top layer, it’s like what you mentioned, you have the user API, and this is what the user is telling DAP what they want to do, I want to select these columns or run this model on this column. And essentially, DAG is translated into what we call like a query plan or a logical plan. And this is kind of like the great compute graph, if you will, of these are the operations that are going to happen. So then the next step is to get this plan. And we figured out what’s the most optimal way to run this. And finally, once we have that optimized plan, we can actually then schedule it onto a distributed cluster using arrays. So each of these steps for a given partition or a given subset of data gets scheduled as the individual function on your cluster. So that’s kind of like the three layers of data, the part that gets really interesting. And what eventually is kind of working on is how we should be storing this data instead? So definitely it is really easy to query data that’s just sitting around in S3, for example, but how should you store it to make it easily accessible, have schemas and kind of all the benefits of a data warehouse? And that’s kind of what eventually is building on the side? So data is an open source tool? It’s really powerful. But the stuff about how do you actually catalog that data? And store it is kind of the main product of eventual
Kostas Pardalis 38:31
Yeah, that makes the most sense. And why Ray? Why did you choose to be ready for it? Why not spark for example? Or like, I don’t know, like something else? Why, right?
Sammy Sidhu 38:40
So the main reason, I mean, array is pretty low level, which lets us kind of have a lot of control of what we’re doing with the second part is, we’re very opinionated about not doing anything Java related in our Python. Like, I can’t tell you how many probably wasted weeks of my life. When you get some random error in Python, you have to scroll through 1000s of lines of Java logs, and spark in a Spark cluster. And when you lose the spark log, then like, I don’t know, CloudWatch or something. Yeah. And you go through 1000s of Java, and you figure out oh, I forgot a comma. Yeah, it’s just not a fun experience. And so we want to do something that’s very native, very simple to use, and something where if a user makes a mistake, which they probably will, that’s really easy to bug. Yeah.
Kostas Pardalis 39:26
Yeah. 100% I think a big part of the pain that people have with Spark is actually like, how to debug these things. I have heard many horror stories that sound like what it means to deal with all that stack traces coming from the JVM. And
Sammy Sidhu 39:45
I’ve had friends I think of when we were starting a venture, they’re like, Hey, if you can come up with a way for me to know if you could just find a better way to present spark logs for you a lot of money.
Kostas Pardalis 39:59
That’s to us. Yeah, I got to get from people like Milton. Like, I think like the worst thing that I have here, like from people saying to me, we just cannot find the locks like, especially when like, you cars running like Spark on EMR and like, kind of cases like in production, it’s like it can get like, extremely painful to do the actual debugging. And yeah, like, it’s one thing if you tried to do that into your like, Java developer, it’s another thing when you’re primarily, you know, your data engineer, and you’re writing your code like in Python, right? And then you have to go and figure out what the JVM is doing there. Like it’s, yeah, it’s font. I again, I kind of get that.
Sammy Sidhu 40:43
Yeah. And I think the other painful part is like, kind of like the not like hair on fire. But the other things, which is like, why is my program slow? Why is it not running as fast as it could be? And then just profiling and knowing where what’s actually happening under the hood is very hard with these JVM tools that interrupt with Python? Yeah, yeah.
Kostas Pardalis 41:03
100%. And you also mentioned rust at some point. So like, how does brass finish the equation here? Because you okay, you weren’t talking about Python, it’s like being opinionated on that. But now we also have rust, right? So yeah, let’s go and go there, like.
Sammy Sidhu 41:19
So the thing is, when you’re dealing with this month, this large amount of data, unfortunately, pythons are not fast enough. So under the hood, kind of like the user API is all in Python, you can run Python functions, Lambda functions, Python objects, but the stuff that’s actually doing the processing under the hood is rust. We crossed the boundary from Python into rust to actually do all the hard computing. So on the top level, the user API is in Python, our plans and she happens in Python. But the functions that get called to actually do the number crunching is in Ross. Funnily enough, we started with C++, because that’s my background. But there’s actually not really that many C++ programs anymore. So we’re like, hey, let’s make the investment. Let’s move to rust. And it’s actually been really amazing. I’m, like, really happy we made that move. So our whole core engine is written in rust, and it makes things very performant. And actually really easy for contributors to jump in and get their hands dirty.
Kostas Pardalis 42:15
So have you by the way, no, looks like we are. That’s like a bit of a different question from what we’ve been discussing so far. But like, I was this plus developer going to Ross, like, what was your experience? And I know it’s like a bit of a controversial question that I’m asking right now because there’s a lot of language wars happening out there about you know, like, C++ is dying grass is like all the thing No, Ross is Bob my goat when usually Bing or whatever. But yeah, how was your experience?
Sammy Sidhu 42:43
Let’s see. So I would say like, I mean, I love it, because I’ve been doing it for over a decade. But the thing that I really liked about rust over the spots that C++ optimizes, essentially, is optimizing for backwards compatibility. That does mean there’s not things that get improved over time. Yeah, because Ross is when you start building it, if you’re just a new, everything kind of comes with sensible default, I think is very underrated. Yeah, so in C++, you can set things up such as, you know, it’s optimized, so it won’t copy won’t do this, they won’t do that. So you have to set it up. And you need to have someone with experience on your team to kind of lay that groundwork, but the roster comes out of the box. And the things like the build system and dependency management all of that come out pretty good out of the box. I think just coming out of the box strong is so underrated in that I feel like you could technically do everything in Ross and C++ it just a lot more groundwork to
Kostas Pardalis 43:44
color percent. I think it’s also like a mother of a developer experiencing something it’s like between the two ecosystems. So how was the experience of like, working like an either operating between like Python and grass? I know that like, Okay, you it’s they work pretty well together. But how was your experience with that?
Sammy Sidhu 44:04
Oh, it was just night and day competitive. Plus, like, what’s the plus like, there are some tools around it. But they’re not great. So usually you end up writing CircleCI pause, and then use something like pi bind. Or you use something like Cython to kind of bridge the two in writing scythe on code just sucks. It’s not Python, which is not fun. It’s not for the most part. It’s what you’re used to. It’s kind of weird in between both Russes. They have this project called Pio three. Yep. And it’s been amazing. You kind of just write your rust code, declare what you want it to be and it just magically works in Python.
Kostas Pardalis 44:39
Nice nice. I think we should call it another episode at some point like going deeper into that stuff because it’s very, very fascinating. The Alexa I think some very interesting lessons in terms of like, how to build like good developer experience like from these really complicated systems, right, like K building like a compilers and all the ecosystem More on the compiler, because it’s not just like recombining itself with like, it’s huge too. I think we should do that at some point. But
Sammy Sidhu 45:07
What we’re seeing is like the data ecosystem. It’s funny, the whole Python data ecosystem is kind of migrating to rust. It’s kind of cool. Like, if you look up, pores are written in Rust as well. Yeah, I believe you guys had bite wax on the show as well. They’re the same thing.
Kostas Pardalis 45:22
100% like lib stuff, like we mother realized, like all these things like with, like, a timely data flow, like, yeah, there’s a lot of work getting done right now, when it comes to data to use like grass is really fascinating. And you mentioned a couple of different projects there. And I want to ask you, like, there’s many things happening, right, a lot of innovation, like we see, like Poland, for example. There are even like, stuff like, the EBS or IPS, I don’t know how it’s called likely for analysis. There are three or whatever, like maybe projects out there that are sampling from the data frame concept, or the bundles concept, and they try to build on top of that, right? As a person who is like, okay, in a way, like doing something seemed there was a bit more about like, how do you feel about what’s going on right now, like in the industry, like things that get you excited? What are you paying attention to? And what you would recommend us also like to pay attention to.
Eric Dodds 46:28
Yeah, I mean, pandas are sticky.
Sammy Sidhu 46:32
Pandas are very tricky. And, you know, I had a hard time understanding why for a while. And then I went to PI data, which is about, you know, a conference on a lot of the numerical tools within Python. And I went to a talk that was teaching pandas, developers or Patty pandas users how to use Python. Okay, and it occurred to me that there’s an entire population of people who know pandas, but not Python, which I had never heard of. Yeah, that’s Wow, okay. Well, that was even like, Oh, this is a for loop. This is how you make a function like they were teaching like these operations where I was like, wow, like, I never realized that. Yeah, people who know a framework within a language but not the language. And so I think when you build tools that kind of cater to that crowd, you kind of unlock a lot of the data scientists and users who are used to these types of tools. So I think Ibis is really cool, because it gives you the API of something like pandas, but then you can target a back end, like, you know, BigQuery, or whatever else where you don’t need to change that much code.
Kostas Pardalis 47:42
Yep, yep. 100% lots of very interesting projects. They have like, a crazy amount of support for different backends, which is, yeah, first wave, like you’re gonna use from Duck db to, I don’t know, like three Nolans? Yeah, Snowflake or whatever. And like, don’t use the same code, you know, like, so. It’s very interesting. Sorry, I interrupted you.
Sammy Sidhu 48:06
Cool. Yeah. I mean, these things, I think the data from the concept is here to stay. I just think what the future data frame looks like,
Kostas Pardalis 48:16
it’s still unclear. What should we be looking at, for getting a glimpse of the future around us but sounds like who are like the team’s outsiders, eventually, obviously, like doing interesting stuff in this space.
Sammy Sidhu 48:30
I think there’s a lot of cool concepts that we should pay attention to, and I think, are important for the future. I think one of the things that duck DB is doing, which is fantastic, is I can just query data without worrying about the format or where it is, or anything like that. So I think the concept of like, the format doesn’t really matter anymore. I don’t have to think about, like, you know, Spark, you have to be very particular, what exact version of Parquet you’re using and whatnot, like that concept is dolet. Right? The next part is being federated. I don’t have to ingest my data to query it, I should just be able to, like, give it an S3 path or a Google Storage path. And it should just work. I think those are concepts that are a must have, and whatever new tool comes up. The thing I think duck DB is, you know, not right, I think is being distributed. I think it’s really important because not everything for the future is going to be on one machine. I think a lot of tabular data. That might be the case for some companies, but there are cases where you need to go distributed or handle fault tolerance. That’s one of the things that data is focusing on. And then finally, I think one of the things I’m really passionate about is making sure these systems can integrate well, with enterprise tooling. We use JDBC for a long time, and I hate it and I think most people do too. But nowadays, new open formats like the one that arrow is building. Yeah, the eight is really cool. I’m really looking forward to that as well. Yeah, Voltron,
Kostas Pardalis 49:57
and like the ecosystem I think it does like some very interesting things and has a very interesting, like amplification effect to this industry that way. Like, it’s, it’s very interesting to see how they’re, what they’re building and what they think that these things have. And they have a very strong relationship with their partners. So right, so yeah,
Sammy Sidhu 50:18
we’re our course on arrows as well for our data civilization. So I think it’s an important tool and it kind of just makes you enter up really easily.
Kostas Pardalis 50:27
Yeah. Yeah. I mean, I think interoperability was like always, like a big issue we, like in the data infrastructure space, which, it seems that like, I was, like, managing to change that. Obviously, things didn’t change, like, from one day to the other, but like, so amazing to see how fast for example, systems like BigQuery, and Snowflake speak right now, like auto, right? And that says a lot about like, okay, like, clear, how important and how powerful like the concept is. All right. I want to give some time, also to Eric, because I’m realizing like the conversation is here, we definitely need to get you back. We have a lot to talk about. But before I give the microphone back to Eric, something that you want to share with us about eventual and daft that is exciting. And he’s coming up soon.
Sammy Sidhu 51:21
Yeah, so daft is doing it’s oh point one release, we’re fully going into beta, we have a lot of really cool features, including our entire new core, though, and rust, are supporting these different types of what we call data types, like supporting images, videos, and these other data types very naturally, we’re planning to do a launch at the end of the month. And we implore you to check it out. Our website is get daft.io Check it out and start the GitHub.
Kostas Pardalis 51:48
Awesome, Eric.
Eric Dodds 51:50
Yes. So I think we have time for one more question. And I wanted to zoom out a little bit. Or maybe we’ll say one more topic, because I rarely stick to one more question. I wanted to zoom out a little bit and talk about the different ways you see currently, and then envision seeing the eventual and sort of the related, you know, adapt, and the relative technologies being used. And so if you think about the sort of obvious ones, even from our conversation, right, you know, processing imagery in the context of the self driving car, right, or algorithms that run, you know, to provide certain recommendations and, you know, in an app like Instagram, right, which is very image heavy, right. But what are some of the other interesting ways that you see this being used on complex? Yeah, I mean, you mentioned you can operate on audio files and other stuff like that. But there’s sort of some of the obvious ones that make sense to anyone who’s listening, but what are some of the more non obvious ones that you think will be really influential? Yeah,
Sammy Sidhu 52:57
It’s a good question. So where are you right now for us, we’re focusing on the most underserved market, which is people we’re dealing with complex, you know, these things like images, videos, what you mentioned. But in the spectrum of complex data, there are things in between that are still unserved. Like the example of the big one, I think of his recommendation systems. So if you have something like Facebook and trying to recommend you, oh, what a post is to show you some of the data that they process for that as a user, and one of the columns might be a list of interactions that they might have had, or list of, like, options of what they’ve done is something that’s kind of complex, but not like, super complex. But right now, if we tried to operate that and existing systems that would run very slowly. Yep. So even these things like nested data is actually very slow in existing systems. And that’s something that we’re planning to tackle next.
Eric Dodds 53:50
Yeah. Fascinating. All right. Well, congratulations. Eventually, and deftly super exciting, we’ll have you back on so we can continue to dig into all of our juicy questions. Thank you so much, Samia. Thank you. Thanks for having me. Costas, talking with Sammy. What an incredible story, right? If you are, you know, in a Tesla, and you’re driving down the highway, and you let go of the wheel, and the car is safely, you know, curing you at 70 miles per hour, without you giving the vehicle any input. You know, Sammy is a huge part of why that’s possible, because of what he’s worked on. And I can’t tell you how much I love that. He drives up an old BMW from the 90s. And it’s a stick shift. And he works on it himself. And, you know, sort of a car, which is so great. I mean, you know, that sort of, you know, that sort of wonderful, wonderful story doesn’t come along every day. So that was possibly my favorite part of the episode. But I also just really appreciated his thoughtful perspective. have just the problem of dealing with complex data in general. And I was astounded by what you know, towards the end of the episode where we talked about like, Okay, you have like imagery, you know, for Instagram and self driving cars. And obviously, that’s a huge problem space for complex data. Yes. And what other things and you said, I mean, actually just hierarchical data, you know, nested data is unbelievably slow when you try to work with it at scale. And so it’s like, oh, yeah, this is, it’s still really early innings for sort of solving problems around complex data. So excited to see what eventually grows into.
Kostas Pardalis 55:39
Yeah, 100% For me, okay. It was like an amazing opportunity. That’s like, talking with him. Because, first of all, we talked a lot about things like how to do that, like, very interesting to me. Especially, I don’t mean just about like, data infrastructure, because, okay, Vendrell at the end, is like the vision there’s like to build a new type of data warehouse that can be used by ML people that’s working with non tabular data. But it’s interesting to see like, how many times no matter what, like what we were talking about, we ended up talking about the developer experience on the seas in Macau, also, like silences developer experience can be like, I think, like what he says, and what he said about the pandas ecosystem was incredible. Yeah, I wouldn’t
Kostas Pardalis 56:38
I like to hear things like that. So dueling is important. It’s like the foundations that we need if we want to accelerate progress, right. And but like, I really talking with Sami, because he gave like some, like very deep insights on why tooling and developer experience is important and how this can be addressed and how evangelist tried for other use case that they have I can, their minds are the problems that they are trying to show the show.
Kostas Pardalis 57:07
Let’s call him back to this Oregon show. I’m sure we have more to talk about with him.
Eric Dodds 57:13
Indeed. Well subscribe if you haven’t, tell a friend and we will catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me Eric DODDS at Eric@datastackshow.com. That’s E-R-I-C@datastackshow.com. This show is brought to you by RudderStack. The CDP for developers learn how to build a CDP on your data warehouse rudderstack.com.
Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
To keep up to date with our future episodes, subscribe to our podcast on Apple, Spotify, Google, or the player of your choice.
Get a monthly newsletter from The Data Stack Show team with a TL;DR of the previous month’s shows, a sneak peak at upcoming episodes, and curated links from Eric, John, & show guests. Follow on our Substack below.