Episode 186:

Open Source and the Evolution of Data Systems with Andrew Lamb of InfluxData

April 24, 2024

This week on The Data Stack Show, Eric and Kostas chat with Andrew Lamb, a Staff Engineer at InfluxData. During the episode, Andrew takes us on a deep dive into the intricacies of time series databases and the evolution of data systems. He discusses the specialized challenges of managing high cardinality data and the trade-offs in query performance. The conversation also touches on the development of Data Fusion, its adaptation for time series data, and the potential for innovation in the query language space. The episode concludes with a look at the future of data tooling and the exciting possibilities that arise from removing traditional constraints in database architecture with each person expressing enthusiasm for the role of projects like Data Fusion in shaping the landscape of data systems. Don’t miss this episode!

Notes:

Highlights from this week’s conversation include:

  • The Evolution of Data Systems (0:47)
  • The Role of Open Source Software (2:39)
  • Challenges of Time Series Data (6:38)
  • Architecting InfluxDB (9:34)
  • High Cardinality Concepts (11:36)
  • Trade-Offs in Time Series Databases (15:35)
  • High Cardinality Data (18:24)
  • Evolution to InfluxDB 3.0 (21:06)
  • Modern Data Stack (23:04)
  • Evolution of Database Systems (29:48)
  • InfluxDB Re-Architecture (33:14)
  • Building an Analytic System with Data Fusion (37:33)
  • Challenges of Mapping Time Series Data into Relational Model (44:55)
  • Adoption and Future of Data Fusion (46:51)
  • Externalized Joins and Technical Challenges (51:11)
  • Exciting Opportunities in Data Tooling (55:20)
  • Emergence of New Architectures (56:35)
  • Final thoughts and takeaways (57:47)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. Welcome back to the dataset show. We’re here with Andrew lamb who is a staff engineer at influx and Andrew, Kostas and I have a million questions for you that we’re going to have to squeeze into an hour. You worked on the guts of some pretty amazing database systems. But before we dig in, just give us a brief background on yourself.

Andrew Lamb 00:45
Hello, yes, thank you. I’m Andrew. Obviously, I’ve worked on lots of low level database systems. I started my career at Oracle for a while and I worked on an embedded compiler for a while in a startup. And then I spent six years at a company called Vertica, which built one of the first sort of a phase, a big distributed, shared nothing massively parallel evisas. I then worked in some various machine learning capacity startups, which was fun, but not really related to data stuff so much. And then last year, I’ve been working with Paul Dix and the co-founder and CTO of Netflix data on the new storage engine for InfluxDB. Three Oh, so I’ve been down working with building a new sort of analytic engine focused on time series.

Kostas Pardalis 01:25
And you’ve also been working a lot with data fusion, right, the open source projects, and there are like, a lot of things we can chat about, as Eric mentioned, but something that I’m really interested to hear from, like your experience and perspective, Andrew, because you’ve been in this space for a very long time, is that it feels like we are almost like at the inflation point when it comes to data systems and how they are built something that’s probably happening for a long time. But data systems are complex systems, and very important systems are always like, let’s say risk is not exactly like something that people want to take when it comes to that data, right. So the evolution usually is a little bit slower compared to other systems. But it seems like we are reaching that point right now where the way that we build these systems is going to radically change. So I’d love to hear from you. How we go to these points, what are the let’s say, the milestones that are leading to that, and what the future is going to be looking like, based on like, what you’ve seen and what you’re seeing out there. So that’s the part that I’m really interested to hear. What about you? What are some things that we would like to talk about? Yeah,

Andrew Lamb 02:40
I would love to talk about that. And I’d love to talk about sort of the role open source software plays in that evolution and how we got there that that was something that in flux, I think always as a company has understood and really valued as is open source, and how that’s both evolving, and then also how you leverage open source to build the next generation products, I think, for Glue for and I think we can illustrate the story of sort of what we’re doing. And I posted you through yo, as part of that longer term trend. And I think there’s lots of interesting things to talk about there.

Kostas Pardalis 03:09
Sounds great. What do you think, Eric? Let’s go and do it.

Eric Dodds 03:12
I’m ready. I’m ready. We need to hit the ground running here, because we have so much to cover. Yeah, let’s do it. Okay, I’d love to just level set on, you know, what influx is? I think it’s very commonly known as a time series database. But can you just explain what it is? And sort of orient us to where it fits in the world of databases? And you know, what do people use it for?

Andrew Lamb 03:35
Absolutely, yeah. So I think in flux DB is a part of a longer term trend. And in the database world, where we went from Super specialized, sort of monolithic systems, very small number of very expensive ones, like Oracle or dB to that kind of stuff. And instead, we’ve seen a proliferation of databases that are specialized for the particular use case you’re using for using it for and so the time series database as a category is relatively new. And I think the reason you end up with specialized categories is because if you focus on those particular use cases, you can end up building systems that are like 10x, better than the general purpose one. And I think that’s approximately the bar you need in order to justify a new system. So time series databases in particular, often deal with lots of really high volume denormalized data. So that means you’re basically having to read CSV files that come in. And quickly, they quickly want to be able to query that like within milliseconds, not, you know, like when I started in the industry, too many years ago, like you did batch loading at night, right? Like you do it once a day, if you were lucky. Yeah. Right. And now it’s like maybe once an hour, maybe once every five minutes, but at times your job is like it’s measured in milliseconds. So that’s really important. You’re able to query super fast afterwards. And then I think also because of this, it’s not part of the application that has been set up before so you don’t actually know what’s coming from the client. So you don’t upfront specify what columns are there. So you basically write sure they just flow in, you figure out what the database has to figure out what columns to add whatever. And then of course, like lots of time series data, the most recent stuff is really important the stuff that’s all there is much less important and making sure your system can like efficiently manage a, you know, a large amount of data, even though the only part of it is really important to deal with low latency. And it’s a huge number of examples with open and closed source at this type of database, like, obviously, influx is one, but there’s a whole bunch of ones like graphite and Prometheus and whole, right? There’s a whole bunch of closed source ones, both at the big internet companies like stuff like gorilla or monarch, or grew up in Facebook or monarch, Google, and there’s like, a dog has versions of this right, that are all proprietary their own surfer server. So it’s a whole category, and Netflix’s biggest open source.

Eric Dodds 05:49
So I wanted to get to something you mentioned. So this evolution, where we will dig into this, specifically like this evolution of, you know, purpose built databases, all combined together. But in terms of time series, specifically, the need for those you said it’s relatively new is that need driven, both by the opportunity that’s created because there’s technology that can solve some of those problems that were probably difficult to solve before, from a latency standpoint. And then how much does just the availability and volume increase in the volume of data that we are able to collect and observe? How do those two factors influence that somewhat recent importance of time series databases?

Andrew Lamb 06:37
Yeah, I mean, I think you hit it. I mean, you understand, right? Like, as we had more and more systems, right, that operated, that you needed to actually be able to observe them and figure out what they were doing. Right, which actually turns out to be a major data challenge themselves. But it’s a very different dataset than like your bank account records, which is what the, like the transit, traditional transaction system databases were designed for. So I think the workload characteristics are quite different. Can

Eric Dodds 07:00
we walk through just one, I’m going to throw an example out, and you tell me if it’s a good one or a bad one, but I’m just going to throw a use case out that I think would make a lot of sense for influx dB. And we’d love for you to speak to us like, well, what are the unique challenges for that? I’m thinking about IOT devices. So you know, think about a factory that’s wired up with potentially hundreds or 1000s of different types of sensors. You mentioned, actually, the, you know, the incoming data can change. And when you think about sensors, installing new sensors, collecting different types of data, you know, that, you know, is probably changing intentionally a lot as you measure different things installing new equipment is so Is that am I thinking correctly about a use case where, you know,

Andrew Lamb 07:46
Yeah, no, and that’s another good example of, you know, in a factory, the rate of change of the software you have installed on your robots is probably different than the rate of change of software that’s deployed in some cloud service, right? Like, it doesn’t get deployed every minute, right? Because you got to, like, mess around with the robots. So that’s all the more reason why the data format that comes off those is relatively simple. And the backend to handle it. But yeah, time series is, or IoT is definitely the classic thing. So once again, the idea is like, you dump your data in a time series database. And now you can do things like do predictive analytics, or pull the past history of those machines or the sensors or whatever, and you build models to predict when they’re going to fail, or when they need maintenance, or, you know, variety of other, you know, machine learning problems, which is now of course, the hotness, right, but like to drive any of that stuff, you need data to actually to train the models on.

Eric Dodds 08:37
Yeah, totally, that makes total sense. But we’ll switch gears just a little bit. So the use case makes total sense. You’ve spent a lot of time though, it’s sort of building the guts of this system, right. And in fact, that sort of helped transform the system into the latest iteration of influx dB. And I love talking about this, because that’s unique to, you know, say, like, you know, you know, ingesting data, or, you know, the dedicated analytics tool, or, you know, some of these other things that are maybe like moving data or, you know, operating on data, after it has been loaded into some sort of database system. So what types of things have you had to think about as you’ve architected the system and are interested specifically in the unique nature of time series data? Yeah,

Andrew Lamb 09:33
So maybe I should start by explaining a little bit about things like, since I don’t exactly have an academic database background, but it’s been a lot of time studying them. And so I’ll just describe a little bit of like, basically all the the original time series databases all have approximately the same property, or approximately the same architecture, which is they all are driven by by a data structure called the LSM tree, right, which is basically just a big, it’s like a key value store effectively. but it’s a way to quickly test data. And so it helps you with a really high volume ingest rate. And it lets you sort of move data off the end, right, as the data gets cold, it’s easy to drop it off the end. But it also requires you to effectively map the data that comes in into a, like a key value store, you got to basically map the data that identifies the time series into a key. And then in order to do any kind of querying, you need to basically be able to map your queries to ranges on that key, which I’m not sure I’m doing a great job explaining. But that’s like the high level. Yeah, totally, totally check in Yeah. And so what that means the types, so what that means that architecture is phenomenal at queries that, like you can ingest data super fast, that’s great. You can also look up individual time series, right? Like if you know exactly the robot you care about, it basically calculates the key range that you care about, and you can immediately fish, right? And it’s like, it’s hard to imagine a better architecture to do that shirt on, I think, tends to come in several ways, like so the first one is, if you want to do a query, that’s not just like, I already know how to which, which robot I care about, right? Or I want to go do something like look across all the robots, then it’s typically very challenging, because you basically gotta walk through all the data to do that. Likewise, if you have a large number of individual things, like lots of little tiny robots, right, or like containers, or something which we would call high cardinality, that times your use case, that’s typically very challenging with the traditional architecture, because you have the same problem of like the index, you need to figure out exactly where to look in the LSM tree is becomes huge and becomes overbearing.

Kostas Pardalis 11:36
Andrew, can you expand a little bit more on the high cardinality? That’s, I think it’s one of the things that comes up very often when we’re talking about time series data. I think people have a concept of what cardinality is and what high cardinality would mean. But what does it specifically mean when it comes to time series databases and why is it so important?

Andrew Lamb 12:03
Yeah, so cardinality just means like the number of distinct values in some ways, right? Like, that’s the academic term. And so like, even if you have a million rows, if there’s only seven distinct values, like for example, there’s only seven distinct, you know, you only have data in seven distinct AWS regions, well, that’s a different data shape, then you got a million rows, and they’ve all got a million individual values, like individual IDs or something, right? unique IDs. And though you might imagine, it’s the way you store that is different. So the way it’s really particular important for time series is because the reason it’s called a time series database is there’s a notion of a time series, which tends to be like a set of time, like timestamps, right, and then measurements of something values at that time. But it’s not just values and times that kind of hang out in the ether. They’re attached to something which identifies where they came from. I like what is a measurement, though, it’s the measurement of pressure of the robot over time or something, right, but not just the robot, like this particular robot in this factory on this floor, or whatever. And so, for a time series database, they’re typically the classic ones that are already organized so that they store all the data for the time series together, and the queries and ask for individual series, right? So like, I want a dashboard that showed me for this robot, what its current pressure value, if there’s something in order to build a database that answer that kind of query really quick, you need some kind of way to quickly figure out from the robot identifier, whatever you have, where it’s stored in the physical structures and go find it. And as I was saying, the classic way to do that is with some sort of tree structure like that you basically have ranges, right, the keys look something like the robot ID and the timestamp. And so then if you want to do a range of things, tell me something about the robot for the last 15 days. But the keys that have that data are like all next to each other in the key space. And so your database just can go look up one place and just read all the data contiguously. And so, so that’s good, right? So that’s great. Now, you asked, What’s the problem with the high cardinality is that I’ve been like hand waving and saying there’s a there’s some index structure that lets you quickly figure out where in the keyspace that particular robot is. So an influx of things called the TSI index. Other systems have the same notion that it’s called something different. But it basically lets you quickly figure out where and the key Ranger, the thing you care about is. And the problem is, typically those indexes are the same order of magnitude size as the number of distinct values and your key. So if you have 1000 robots, you’ve got 1000 entries in this file, it’s fine, that we’ve got like 15 million, you know, you got a million robots, or 10 million robots or 100 million robots, like index size just keeps getting bigger. And so then managing index becomes, you know, like, that becomes the problem. And that’s exactly why most time series databases will tell you not to put high cardinality data into it. So it’s not that they can’t do So with large numbers of values that can’t deal with large numbers of individual identifiers to the time series themselves,

Kostas Pardalis 15:06
and I would assume this is always some trade off there, right? That’s why, like, we have these problems, they will have to trade something for something else. So what do we have, let’s say in a time series database, if we would like to also support high cardinality without even having to care about rights, like, why would we have to trade off?

Andrew Lamb 15:35
Right, I think what you have to trade off is the absolute minimum latency to request a single time series. Like the way that traditional time series databases are structured, it’s hard for me to imagine some way that you could do better if what you want is just the particular time series for some particular range, you’ve got that basically, in some memory structure, and you’ve got a single place, we go look at it and figure out what range you need to scan. Like, that’s pretty fast. And it’s hard to like, I can’t, you know, maybe engineering ways, you can eke some more percentage out of that, but that’s not like some fundamental architectural thing. Like, you look up where you want, it’s all one contiguous thing, and you just go find it. So that’s like, the thing that they’re phenomenal at. And you’re gonna have to trade off like, super low latency. I think he’s still very good, right? We talk all about this, but I don’t think you’re ever gonna be able to do quite as good.

Eric Dodds 16:33
Yeah. And so that’s like the so you’re late, like high cardinality is really an impact on the on latency, right? As you’re operating over the data. That’s the trade off for the end user. Well,

Andrew Lamb 16:48
the Trino for the users typically, if you put high cardinality, try to put high cardinality data into these systems, it will tend to cause real trouble, like, the maintenance of the index will just basically dominate. Yeah, the other alternative is use don’t put that in, and you have but instead, you now need to like every query, instead of knowing exactly where to look, it has to look at a much broader swath switches are working every query. Yep, yep. Sorry to interrupt, Kostas. No, go ahead.

Kostas Pardalis 17:16
I interrupted you in the first place. So please,

Eric Dodds 17:19
I don’t know, though you were asking some questions. So well, actually, just to continue on that. So when we think about this trade off, especially as it relates to cardinality, like what are the because I mean, you don’t necessarily always have control over the actual nature of your, I guess, physical data, if you want to say that right. And so is the response to this to say or like, what do you do when you talk about sort of best uses for a time series database? Like, is that limiting it to certain, you know, so, time series data that has lower cardinality? Or how like, as a data, if I’m choosing a specific tool to manage time series data? How do I need to think about where to employ that? And where, because of cardinality issues that might cause you know, cause issues for me? Yeah.

Andrew Lamb 18:13
So just to be clear, I want to start by saying, I haven’t gotten down in the trenches, deploying ties your database to people, so I can give you a high level answer that is not born out of a shared experience. So I think typically, what will happen is, you know, we’re just cardinality, data high cardinality to come from, right, if you if what you’re doing is tracking, like servers in your server farm, like, you only have so many servers and probably most times here, David, I’m sure if what you’re trying is like individual Kubernetes containers, right, of which every time you know your micro service has Howard Yeah, for many of them, right. And they continue to get created and destroyed, right like that, that starts looking like a high cardinality, column burnin fast. And so I think, if you want to use one of these sort of traditional time series databases, you have to be careful when you are deciding what to send into the database and what to store and how to identify the things you care about. You know, like, probably you don’t send the individual container ID, right. Like you probably have some sort of service identifier instead, which is low cardinality. So I think it’s not just a traditional database, a time series database for those use cases. But it means that you have to be much more careful when you stick data in that you don’t, you know, inadvertently include a high cardinality column and something that’s going to cause trouble. Yeah, plenty of people were able to work, you know, make that kind of change.

Eric Dodds 19:29
One other question that you have based on this is great. By the way, I don’t know if you’ve talked about time series in this stuff in quite some time on the show houses, but you talked about cold data, you talked about the use for time series, data being the need to query that data in sort of an immediate, you know, for the, the queries needing to be more immediate, if you will, in terms of thinking about when data goes cold or when you need to offload That’s obviously a decision for the end user to decide what their thresholds are for that. But generally, when we’re talking, you know, if we go back to the IoT device, question, you know, in that case, like, what would you see for influx DB users? Like, when are they offloading the old data? Or their use cases around the historical analytics around that, as opposed to getting a ping for potential device failure? Because an ML model runs on, you know, the data in near real time? Yeah,

Andrew Lamb 20:31
so I think you’ve hit on the head there with like, the reason there’s, well, let’s see, let’s take a step back. So if you do like the traditional architecture, you have a time series database that ingest the data quickly and keeps it available for some window of time, I think, typically seven days or 30 days, like either those common times, maybe two months, but probably not that much longer. And then if you want to do longer term analytic queries, you tend to have to basically have a second system, right? So rather than drift, directed to both the time series database and this other system, right, that then use, that’s probably some big distributed analytic database or something, right? Yeah, just doing classic, you know, look back at analytics. Yeah, yeah. And we’re the late we’re much less late, late sensitive, but less latency sensitive. So I mean, obviously, that’s one mission of InfluxDB. Three Oh, is to, you know, it’s really inconvenient to have two systems, right? If you can have one system, or even better if you have a system that you don’t have to you, developer don’t have to direct the data streams into two places, right, you only have one ingest pipeline, and then I’ll put maybe I’ll talk a little bit about the new version later, but you directed it into your time series database and eventually ended up as Parquet files and object store still would have a late low latency love you while you enjoy you need what your existing analytic systems would probably also deal with Parquet on object store these days, sort of have a unified view. The other thing that I wanted to mention about the cost thing is, I think most customers aren’t like, oh, I only want seven days of data, right? Like, they want more, they don’t want, they don’t want much more than 30 or whatever. They’re typically driven by the cost, right? They just share a certain amount of cost or capacity. And the current time series, technology required, basically fast disks and in memory, you know, like large amounts of memory. And so sure, that tends to limit the scope. But I think there was a real need to like, Hey, I don’t want to have to put like, like, it doesn’t make sense, right? If some of your data is really important to be hot, but like, I’d love to keep three years worth of data. But like my expectations on query latency on that is much lower than if, well, you know, it’s okay. If it takes longer than like, the stuff that came in 30 minutes ago, being able to do that in a cost effective way was another drive. I refer folks to be 300.

Eric Dodds 22:46
Yeah, I was just gonna say, I want to dig into influx 3.0. Before we get there, maybe this is a good bridge. But can we zoom out just a little bit and talk to you because you mentioned, you know, having to do those in two separate systems. I mean, that’s, you know, there’s a lot of overhead involved in that. But zooming out just a little bit, that really gets at this question of data, what you said at the beginning, right, where there’s this trend of sort of breaking apart the monolithic systems into these various database systems. tons of really good open source tooling out there today, you know, sort of enterprise grade, I guess, you could say, how do you think about architecting? A system where you have multiple databases? I mean, in some sense, like, having a giant Monolith is, you know, it certainly decreases the complexity, even though there’s tons of trade offs there. But what do you think about that? I mean, we could call it a modern data stack, we could call it, you know, sort of compartmentalizing these pieces using best of breed tooling, but helping us think through that.

Andrew Lamb 23:50
So traditionally, right, and since the database was tightly bound to the storage system, like the database was the source of data. And so if you wanted multiple databases in your system, you’d actually have to orchestrate the data flow between them, right? That’s what ETL is all about, right? Like you take the data from somewhere, you extract it, transform, you put it somewhere else, right? So yep, that’s the end. So I think as you pointed out, the problem with doing that is now you want specialized tools for the specialist job, you have copies of your data and data pipelines everywhere. And it’s not just ingested but like then once inside your calculator wherever you need to get to it. So I think what’s happened over the last five years and it’s only going to happen more is that people, the source of truth for analytics specifically, are moving to like Parquet files on object stores. Right as S3 said, Ryan blue from tabular here the other week, right? He’s solving another problem that comes out of putting a whole bunch of Parquet if I was doing, but the economics and compatibility of your data is Parquet files and object stores. I think it’s compelling enough, that’s where everyone’s gonna move to, it’s basic. It’s not, you know, like having an infinite FTP server in the sky. Sorry, I’m dating myself, right? But S3 is basically you know, it’s like the internet . It is super cheap. Oftentimes, we like to have terabytes of stuff. They’re like, Oh, we probably don’t even need that. But like, it’s right, like, that’s good. Yeah. And so I think that’s going to become the source of truth, right. And then what you’re going to have is a bunch of specialized engines that are operating and are specialized for whatever your particular use case is, that will read data out of there and potentially write it back. You know, obviously, that’s enforced, these three are one of those. But I think, basically, any database system that was architected relatively recently like that, for analytics, especially like, that’s the trend, and I think that’s only going to accelerate, I think, then it’s really good, because now you have data in one place. Maybe depending on how you feel about Amazon or the other cloud providers. But now, your tools will now be able to talk to it in the same place rather than have to have an endpoint with connections. Right, like daisy

Eric Dodds 25:54
chaining all these different systems. Yeah, cast us I know, you have a ton of questions around this architecture and have actually thought a ton about it yourself. So we’d love to hear your thoughts as well on the market moving towards this. Yeah,

Kostas Pardalis 26:06
100%. I think like, I mean, kind of like a natural evolution. I think the same way that we think about, like how we were building applications, like 1520 years ago, right? Like, the monolithic way of doing it in the early 2000. Like, would never be able to scale to where the industry is today, or the market is today. So we had to become much more modular, right? Like the beginning we’re talking about, okay, like, we have the back end, the front ends. And these actually involve much more granularity, like architecture in the end, that’s like, for the applications, right? Like, we kind of need something similar for databases to like, if we want to build products around data, we can’t afford for the teams, you know, like to continue, like, you know, what they were saying that, to build like a database company, you need like 10 years just to go to market, before you go to market, because you have like to build first, like in 10 years, it’s like completely different world out there. That’s why we have so many that are based on lots of deaths, like people are trying to go and do that. And then the time you’re ready to go out, like you just cannot go to market anymore, right. So if we want to have a similar way of building and creating value at all on top of data, we need to replicate these kinds of patterns. That’s software engineering, done like other systems, but now doing it like data. On a daily basis, practically, like it doesn’t matter. Like at the end, if it’s like a distributed processing system or whatever, like at the ends, they all share the same, let’s say principles behind them or like how they are, they are like architects so it’s inevitable to happen. And there’s no way that we can, you know, whatever we talk about like AI or whatever is going to happen in the next 10 years, without strong foundations from the infrastructure that Mars creates and exposes this data, like nothing will happen. Like it’s like, it’s just like, how this industry works like how the world works at the end. It’s like physics, right? So that’s, I think, inevitable the questions like how fast it will happen, and what direction and that’s like what I would like to talk a little bit with Andrew because underuse bringing, like a very unique experience here. I, in my opinion first because, okay, he’s been building database systems for a very long time. So he has seen, like, the old monoliths to whatever, like the state of the art is like today, but also has experienced, let’s say, what it takes to leave the monolith behind and build something that adopts these new patterns. So, Andrew, I’d like first to ask you, what do you think are the main milestones in the past like 1015 years? That’s got us where we are today, if you have to pick two or three. And I think you mentioned something like already, which is like, the separation of storages with Compute, in a sense, right, which I think was like what started everything in a way. But there are others too, like it’s not like that separation. So tell us in your mind, what you would put there as like the top three most important changes that happened.

Andrew Lamb 29:55
Yeah, well, so take what I say with a grain of salt because it’s very focused. done the analytics, and it’s all sort of colored by my background. But I think the first thing that was really important was like, you know, the late, whatever it is 20 2010s was parallel databases actually became a thing, right that they weren’t some crazy hardware. You know, I don’t know if you guys remember. But like earlier, you want to do like four cores on which you had to buy some fancy tandem like specialized hardware thing, right to actually run multiple cores in your database, like it was crazy. And so there was a raft of companies, including Vertica, that were part of this wave that they went from, like single node machines to like actually had commercial offerings that could be big parrots like that were parallel databases. So I think that was one he actually figured out parallel databases. Second one was Columnar storage, as you put it for analytics, right? Like before, again, the same. So it’s actually often a lot of the same companies. But it sounds stupid, right? Like, well, traditional databases store stuff in rows. Now we’re gonna store stuff in columns, and that’s gonna be earth shattering. Like it doesn’t sound all that amazing. But it ends up being really important when your workload shifts from like, uploading, load, heavy read, like transaction systems, where you have lots of little thing, lots of little transactions, but each one touches like a small amount of the data, that when you switch to when you really care about analytics, or really analyzing large amounts of data, the workload pattern is very different than the queries that come in as many fewer per second, but each one touches a much broader swath of the data. And so organizing your data in columns, as a column store really makes that you know, significantly faster to do and much more efficient. And then the third one, since I get three, is the separation of storage and compute, like you said, that I think is often referred to in the database term as like a disaggregated database. And the distinction is like, you know, Vertica, and similar, like par Excel, the early versions of Redshift were like this two, were what’s called shared nothing architectures, which basically meant you had a compute node attached to disk and like, the Distribute database was sharded, effectively, internally, but then the individual nodes are responsible for individual parts of the data, which is great, because that meant you’re guaranteed to have co located compute that was, the problem is if you ever wanted to change, like, basically, if you wanted to have elasticity and change the resources you gave to the system, it was you had to move all the data to because like, the data was tightly bounded the compute and so that, I mean, we, that was definitely challenging. Vertica was dealing with that scalability, but I think that’s why disaggregated databases were, like, became a thing. And like, Snowflake now seems obvious, right? But I remember reading the Snowflake paper, like that, they wrote a paper about their architecture back in the day, and it sounded crazy, right? Like, gonna build a, you know, we were building distributed databases for these, like hard, like, really high end, enterprise grade machines. And they’re like, we’re gonna build a stupid database on a bunch of crappy VMs from, you know, who knows where, and we’re gonna build it on top of this object storage thing, which has, you know, it’s like, hundreds of times slower than these fast disks that we’re using in this shared nothing architecture. But I think they really, you know, looking back on it, it’s clearly prescient, they were basically the first major commercial success to have that architecture. And it seems it’s like, it just seems obvious now, but that was like, really revolutionary, I think at the time, like, Yeah, it sounds like a crazy idea. Like, how are you ever going to build a database like that? Yeah.

Kostas Pardalis 33:14
Okay, so let’s talk a little bit about, like, the evolution of influx, because from what I know, like, you started with different, let’s say, core technology there, and then at some point, you re architected and move to, like, some different technologies with what will get us also like talking about data fusion here. So tell us a little bit about the story. What was like, how, first of all, like in flux? Right? What triggers the need, like to investigate, you know, like moving to something else? And what at the end made you move into using something like data fusion?

Andrew Lamb 33:55
Yeah. So I can talk a little bit about the motivation for influx, the product, I think Paul is probably the best person to talk about it, but he’s spoken about it publicly. And it’s not surprising. When you think about it, a bunch of the challenges we talked about early with time series databases, like high cardinality and having bigger retention intervals and doing sort of more analytics queries. All those were long term asks from influx data customers, right. So Paul’s looking for technology that would allow basically, to satisfy them. And I also think, as I understand it, the markets are effectively becoming commoditized, right? Like there’s a huge number of Prometheus came out, and a whole bunch of other open source various products that they’re basically in flux clones. I’m sure that everyone will take offense at that, right? I’m not trying to offend, but like, basically, the high level architecture is saying, right, you got this memory tree, you have some index, and maybe it’s better but like at the core there, it’s all getting commoditized. So I think Paul is looking for how we do where we bring the next that brings the architecture to the next level. And to your point earlier, though, like building a database, as he knows because he just did it for like eight or 10 years, right? Like, it’s a long term experience, right? It’s much longer than you ever want. And it costs hundreds of millions of dollars. I have some slides somewhere that I, you know, go through and just look at, like, pick your database company and go look at how much money they raised. And it’s probably, you know, hundreds of millions of dollars. And it’s not that all that money gets spent on engineering, of course, but a substantial amount does. So I think when Paul was casting around, like, “Well, how am I going to build a database engine?” You know, the only practical way to do it is probably to find the things you can reuse, you know, to build it all from scratch. And I think he early on like, like, even at 2020, he wrote a blog post that said, patchy arrow Parquet data fusion flight, like they’re, they’re game changers from implementing databases. You know, at that point, I was like, stuck in some engineering management, nonsense. So like, I just basically talk about fate. Paul said, Hey, these are some great technologies, I said, as long as you let me code, I caught it, I’ll do it in rust, and I’ll do whatever these technologies you want, I just need to get back into engineering. But I think he was very, like, he was very prescient, like, I think he identified those early on. And then, you know, you’ve invested to help those ecosystems drive forward. But I think those are like the core of the core technologies. And the reason they’re core is because if you’re gonna go build an analytic system, now, it’s been done for, like, 20 years, so like, we’ve had two or three waves of commercialization. And that was built after like, 10 years of industrial or academic research. So basically, the point is, like, people have built the same thing over and over again. And now I think we are finally at the point where, like, the patterns are clear. And so rebuilding it again, like you don’t need to rebuild it again, you can actually build it. And the API’s are so like, things that are really like, basically common to every OLAP engine, and I’m talking to everything like Vertica Hadapt, Snowflake has duck DB has it, you know, data fusion has, like you need some way to represent data in columns, both in memory on disk. Because that’s so much more efficient from the analytics workload, right? So that turning into a patchy arrow is basically a way to store memory, stored data in columns on the array, right and ways to encode it. And Parquet is basically the same thing with a disk. I’m obviously skipping a bunch of details. But between those two technologies, now you have the in memory processing component, you’ve got the storage component. And then you’re going to build on these database systems, you typically need a network component, right? Because it’s either a distributed system, or you’ve got a client server, so you need some way to send it over the network. That’s what arrow flight is. And then often, right, you don’t want to, well, I’ll tell you like building, something that can run SQL or some language like C++ is a non-trivial endeavor. But just like, you know, building a programming language is a non-trivial endeavor. But you can still do it. But basically, you can do it these days. Because you don’t have to build the whole thing from like, the front end to the thing to the intermediate representation, all the optimizations to code generator, like because basically LLVM is a technology for compilers, right, that does all that. Right. That’s, I think that’s why you ended up with systems like rust or Swift or Giulia. The reason they can be, like, even made it all these days is because they don’t have to go reinvent all the lower level stuff. Yeah, got to the same point with SQL engines, where like, you don’t need to reinvent the whole sequel engine from the bottom up. Because it’s basically all like, we understand how to do it, the patterns are there. You know, an arrow is basically what you want. Parquet is but you know, obviously there’s things about Parquet you could improve, but like, it’s pretty, pretty good. And so with that type of building blocks, I should say data fusion is an Apache project for that query engine, right? So it will let you run SQL queries and a whole bunch of other stuff, which we can talk about at length, but I don’t want to take over the whole talk.

Kostas Pardalis 38:51
Yeah, so would you say that, like data fusion is the equivalent of a little VM but for database systems?

Andrew Lamb 38:57
Yes. That’s how I like to think

Kostas Pardalis 38:58
about it. Okay. Okay. I think that’s like a great metaphor to use to give a high level view of what data fusion is trying to do, too. All right, question. Now, I understand the need of going from like, the pure timeshare is workloads to a more analytical, the team capabilities, adding more capabilities, so make sense to go and look into like a system that is built primarily for analytical workloads and data fusion, we find that wrong, like when mom and the like, started like building. That’s what he had in his mind, right? He didn’t have time series data in his mind, which means that somehow these things need to be breached. Right? You still need to work with time series data, like you can’t abandon what you were doing. So how do you take their fusion which is like a system for analytical workloads? And you don’t need the core of a time series database that does everything other time series databases do and has to do. But at the same time also has like, let’s say analytical capabilities there.

Andrew Lamb 40:16
Yeah, so you’re absolutely right, that there’s a bunch of stuff that’s time series specific is not in data fusion. Right. So in fact, in some ways, that’s a great story, because you get data fusion for free, like, you spent a little bit of time but, you know, data fusion is there, and you can work with a bunch of people that get better. And then like, the vast majority of time, in terms of like, where engineers spend their time is on time series specific things, right, like, so there’s a lot of effort that went into the component that ingests the data really quickly. Turns it from whatever this like this line protocols, CSV kind of stuff is into in memory formats, right? Someone’s got to parse that quickly, you’ve got to stick it in, you’ve got to handle persistence, like getting that into Parquet files quickly. And there’s a system in the background to like merge them together, and whatnot. So there’s a huge amount of engineering effort that went into that. And then the systems aspect of remote knowing who has what data and where it is, and all that sort of stuff, like so that we’re in flux, spend a lot of its engineering time on those time series specific features, right? And then, as long as the infusion is fast enough, it works. So it’s kind of cool. Like early versions of our influx dB, three products only did sequels, because we just ran it through with flux, then, we just ran it through data fusion. However, influx DB also has a specialized query language called influx QL, which if you’ve ever worked with time series, you’ll know that sequel is miserable for a lot of different types of queries. So influxql is actually a much, much nicer query language. So that’s another example that is very time series specific. But actually, what we did was, we don’t have our own. Now InfluxDB supports three Oh supports SQL, we don’t have our own whole query engine for SQL, there’s just the front end, right, that like translates in flux QL into the same intermediate representation that that data feed, which are called logical plans, and then the whole thing just then the rest of its shared with the sequel agent. So maybe those are some specific examples of, you know, the time series database, it didn’t just magically come out of data fusion, we had to build the time series pieces around it. But we don’t have to then go build like, the basic vectorized analytic engine, right that I can, if you want to hear a list of the types you need, like, you need to be fast and need to be multi threaded, you need to make sure it’s streaming, you need to make sure that it knows how to aggregate stuff really fast. You have to do it quickly with multiple columns and single columns, and all the different data types and all those, you know, five different ways of representing intervals and yeah,

Kostas Pardalis 42:43
yep, yep. Yep. And yeah, yeah, it makes total sense. I think that’s like one of the, in my opinion , also the kind of traps that people get into when they are building new data systems. He’s, SQL is the standard, but at the same time, like, it’s not great for many things, and when you get like, into specialized things, then you’re like, Okay, well, shit, like what, like, let’s reinvent and build a new query language, right. And the problem and I think that’s what is great with, like, what you’re saying about like data fusion is that now it’s like, much easier for someone like to be like, hey, like, I can have like the standard SQL that I support here, which makes it easy for anyone like to go and walk without having like to do the heavy lifting of like learning a new language, right. But at the same time, I can model my product, like new dialects, or like new seductive sugar in the existing SQL or whatever. Without having an overly complicated system, they’re endowed. The equivalent of like, let’s say, what the user their faces in a way for like a couple of occasions, right, the graphical user interface, but now we’re, in our world, like the equivalent of labs, like this index that you provide, like to the user. And I think like, big problems in the past, actually, because of the monolithic architecture, right, like adding something new, they pretty much have to go through the whole monolith. So that was like a big no. And I think that’s like a lot of things, the value added, let’s say, for the user for the developer experience is like broad, because with, like, the systems work in it, what the influx does is like, I think, like an excellent example of that. Yeah, we have SQL. And we also have, like the specialized index that we need to go and like to easily work with, like time series data and be more efficient and more productive, which is amazing, right? What’s that? Like, if you have like a, like an, even like an anecdote my way, what’s from data fusion, and it’s focused, like the analytical workloads made your life harder when you had to work with time series data if there was something right but what

Andrew Lamb 44:55
I mean the first major challenge is the time series data model, at least as shown by, I’m going to talk all about like the lion protocol, which is the influx data version of basil. Every system I know about that does. Time Series, especially for metrics basically have the same data model, they call things differently. But it’s basically the same. So you basically have some identifier that’s a string, right? That identifies your time series, and then you have a value and you have the time series. And so the diffusion is very much a relational engine, right? Like it uses a pen and so as Apache arrow, like, yes, they have structured types, but at the end of the day, it’s like, basically tables of rows, right, like, and so the very first challenge is you gotta map the influx data model back into a relational model. So we did, right, but that definitely is non trivial. And it ends up like, you know, there’s nulls, and new columns can appear and dealing with backfills. And what the update semantics are and stuff like that, that’s there’s quite a lot of logic that we had to work out. As opposed to just having Parquet files on the disk array, we just run them right. Like, it’s a lot, we actually get this other thing, and we can do a lot of work to figure out exactly how to map them into the relational model.

Kostas Pardalis 46:09
Yeah. Oh, that’s very interesting, probably, like, a topic on its own for a full episode to talk about. But let’s switch gears a little bit, because I want to talk a little bit more about things like data fusion and the current state of the projects. And what’s next? So you had, okay, you engaged in using data fusion? You’re like, okay, my pretty mature point right now, I guess without things like using data fusion, choosing a project has gained much more traction that like many people using it, and many contributors back what legs? Because it came like a system, like a little VM takes a lot of likes, hundreds or like, I don’t know, 1000s of engineering years like to build maintenance. That’s probably true. Also, for something like they love fusion. Right. So what next for the project? Like what do you see being like, the parts of the project that need more love? And where the opportunities are, in your opinion? See?

Andrew Lamb 47:17
Yeah, so let me a lot of talking about the technical challenges opportunities in a second, I just wanted to look at a project level, whereas this project first, so I think now we’re at the point where there’s early adopters, like influx, and there are a couple other early adopter companies that build products on top of data fusion. So influx DB is one of them, obviously, but there’s great times a couple of core logics. And then there’s a whole, there’s another. So the early adopters, maybe you’ve been doing this for four years, then there’s people like maybe in the last year or two have basically decided to build products on top of it. And that there’s a whole new raft of that like senaida and Arroyo. And there’s probably a whole bunch of other ones I don’t like to see fail. I guess we’ve actually got the remote early adopters. But there’s a bunch of other sorts of startups now that are starting there, because they want to build some cool analytic thing, right. And they realize, building a whole new sequel engine is probably not where they can afford to spend their time. So that’s sort of where the adoption trend is. And I think we’re just gonna see more and more that this is obvious to people I don’t like. It’s an open source project, and I can use this well. Actually, you know, I’m sure people use it that I don’t know about other people. And then the project has been part of this, like Apache arrow governance thing. But we’re in the final phases, I expect it to be sort of finalized the next week or two of like, it’ll become its own Apache top level project, which for those of you who aren’t deep down the guts of Apache governance, which you don’t need to be, but like, it’s just basically like, a recognition that the community is now big enough and self sustaining enough that it needs its own place, rather than being an arrow. So arrows, it was wonderful to be incubated there. But it will be its own projects very soon, I think that will help drive its growth as well. So yeah, so now, technically, yes. I think, from my perspective, data fusion now has basically very competitive performance. We just wrote a paper in SIGMOD. If anyone cares, like it’s, you know, in my, in our results, if you look at it, Grant Parquet states that we’re basically better than duct TV, in some cases, not as good as some seats, but basically approximately a foot one inch. So that’s actually pretty good. I mean, the duct up guys are very smart. They’re basically like the basic state of the art integrated OLAP engine these days. So I feel very good about the core performance. And it has the basis. So basically, from like, if you’re gonna build us an analytic system today, you need like basic SQL you need all the time, like places that are hard to do that we’ve done to date if you’d like, like, you need the basics of SQL, you need the basics of all the timestamp functions, which if you’ve never built a database, you don’t understand like, it doesn’t seem like it should be that big a deal. But actually getting the time and time arithmetic right, is just a major undertaking. But we’ve done that a data fusion implementing window functions is also another one that takes a long time. that, you might not even appreciate that those are things in sequence that you have to implement them. That if it has that, and then you know, you need a fast, you fast aggregation needs fast sorting and a fast joint. And then not just any fast, you need to be able to handle those three major operations if you don’t, if your data size doesn’t fit in memory. So de-duping does all that, except the one thing it doesn’t do is it doesn’t externalize joins yet. So if you want to go build a system on top of data fusion that was kind of like compete with other, like, basically, you’re gonna have to complicated enterprise market, right, like some enterprise data warehouse thing where they’ve got 300 joins and their queries, I’m not sure that future will do a great job. Actually, having spent like four years of my life on the Vertica optimizer, I’m not really sure that a database system can do a great job on a query with 300 joints. So yeah, it’s just a good optimizer that doesn’t have one of those. It has his fine one, right? It does fine. But it’s not going to work. Right with hundreds of joins. Yeah, it doesn’t have externalizing joins.

Kostas Pardalis 51:03
Can you expand a little bit more on that? Because probably our audience might not know what first of all means, like externalizing joins.

Andrew Lamb 51:11
Yeah, so this Yes, what this really means is that you can do these operations if they don’t fit in memory. And maybe I’ll start with a simpler one where you just want to sort the data. Let’s do grouping. Right, like, let’s say you’re trying to like group on all your different distinct user IDs or something, and you’re just calculating some whoever clicked the most or something or the how much dollars, but you have a huge number of individual user IDs, way a database will typically do a query like that is you’ll read the rows, and you’ll figure out the row ID, and you have sort of like a hash table, that you’re keeping track of the current values for each one of the different users, right, so as the data flows, and you find the right place in the hash table, update to go on. The problem with that’s called hash aggregation. That’s basically what they all do. The problem is that if you have a huge number of distinct users, you might not be able to fit the hash table in memory. But it also likes to wait, making that quick is actually really like another whole fascinating discussion, but I’m just talking about externalizing right now. So that you don’t typically do one at a time. So the problem is, if you hash tables and fit memory, what do you do? Right, the first thing you do to Well, the first database system you write probably just crashes and gets killed by Kubernetes. You actually track how big the hash table is, right? Because it just allocates more memory, the operating system kills it. Second thing. So then what you do is you add a memory limit in those tracks how big the hash table is, when it gets too big errors, the query, so that’s better. And then the third more sophisticated thing to do is you actually take the hash table, dump it state to disk somehow, right? Some persistent storage, do we add a couple of times based on how much data is coming in, and then also then take it as you wrote a bunch of little files to disk and you read the back end, merge them and compute the final result generator. So that process of like taking the state out of main memory, putting it to some other secondary storage that you have more of? B typically called like, externalizing.

Kostas Pardalis 52:59
Okay. Sounds like spilling data to the hard drive. So you don’t have to kill the process because it runs out of memory. Today,

Andrew Lamb 53:09
if you just do that, for joy, it does it for sorts, it does it for grouping, but it does not doesn’t fit.

Kostas Pardalis 53:12
Okay, so for everyone who’s out there and would like to contribute, I

Andrew Lamb 53:16
want to do it, you know, that’s a whole Yes, I wouldn’t. There’s probably no site, it’s a permanent afternoon project, but I am interested. But I joke about this afternoon project, but it still amazes me like working with data fusion is the first time I’ve worked in an open source project. You know, I’ve used open source software for a long time, but I like actually running these things. We’re running it like, I don’t really know why people show up with contributions, right? So sometimes it’s clear working this in my job and I need this feature, right? Like, that’s probably pretty coveted. A bunch of people are like, they just find it interesting, right there. It’s one of those rather beautiful things about working on the internet, right? There’s lots of people like, there are a certain number of people that find database internals interesting, they show up. But like this, I’m pretty sure the first version of the window function was written by a guy who, like I don’t think it had anything to do with his job. I think he just thought it was an interesting challenge. And you know, he’s like, he’s a manager at Airbnb or something like not at all obviously using data that he just basically piloted out the first version of window functions. It was still amazing to me you know, he’s a great hopefully he had a good time. I think he had a good time. I had a good time. You know, if we got a feature out of it, it’s just a really cool way to learn.

Kostas Pardalis 54:35
That’s amazing. All right. We are close to the end here. And I have to give the microphone back to Eric. And I think we need multiple more of these recordings. Andrew, they are just like so many things like to talk and it’s such a joy to listen to you like it’s a lot of glare and so I’m looking forward to doing that again in the future. All yours again, Andrew,

Eric Dodds 55:01
I’m interested to know, you know, we’ve gone into such depth on several subjects here, which has been so wonderful, you know, outside of the world of sort of databases and time series and the subjects that we’ve talked about in the broader landscape of data tooling, what types of things are most exciting to you?

Andrew Lamb 55:21
I think the broader story is that the Parquet files and objects store data, lakes, data tables, like whatever they’re, eventually people decide to call that. I think that is an amazing opportunity. Because now, you know, it’s not going to make things simpler, what’s going to do is it’s gonna let you make all sorts of new, more specialized tools for whatever your particular workload is. So I think that is very exciting. And I, you know, I also, I’m also super biased, right? But I think the idea of data fusion is super exciting, too, because, as Kosta was talking about before, previously, if you had an idea, because you hated the sequel, you had a better idea of how to do it. Not only do you have to be really good at like, figure out what the UX of that new tool should be, you didn’t share all the interest, you had to have all the ability to go do all this low level database, the stuff that we’re just talking about, was a, you know, some people have that, like the guy who like polars, the Ricci actually, I think has that, but like, it’s very rare. And so I think having some, like data fusion, will let people innovate in the query language space, or, you know, space without, you know, basically lowers the cost. So you have a great product without having to invest all this time and like all these analytic engines, so I think that’s a super exciting area as well, to see what people build with it. Yeah,

Eric Dodds 56:35
I agree. Yeah. It is really neat to see sort of architectures and technology emerge that encourage all sorts of creativity by, you know, unbounding people from traditionally what were pretty severe limitations, except for people, like you said, where it’s like, you know, pretty rare that you can manage all of these different things. To create something fairly novel.

Andrew Lamb 57:00
I think that’s the most positive way to look at it. You want another one? Like what sort of boring, boring economic ways it’s just commoditizing? These OLAP engines, right? That’s another summit. It’ll be the same thing. But I think yeah, under sells the value of it, because it’s not just Yes, it’s making it cheaper. But it’s not just the cost that comes down. But what it really means is that there’s a whole bunch of things that become feasible that no that weren’t previously, I think that’s what happened. Yeah,

Eric Dodds 57:24
nearly a goal and the removal of constraints. Well, it’s certainly going to be an exciting couple of years. Andrew, this has been an incredible conversation. We’ve learned a ton. I think the audience has learned a ton. And yeah, we’d love to have you back on sometime to cover, you know, all the things that we didn’t get to cover in this episode.

Andrew Lamb 57:43
That was great. I’d love to thank you very much.

Eric Dodds 57:46
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.