This week on the Data Stack Show, Eric and Kostas chat with David Kohn, Senior Product Manager – Toolkit at Timescale. During the episode, the group discusses time series data and databases, innovations using PostgreSQL, why Timescale builds their databases differently, and more.
Highlights from this week’s conversation include:
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 00:00
In the dataset show is brought to you by RudderStack. As we said last week, March is data transformation month. And there’s a competition, if you submit a transformation to our source library, and you get $1,000 cash prize, you can catch all the details on RudderStack. Twitter. Good luck. Welcome back to The Data Stack Show. Today
Eric Dodds 00:26
We’re going to talk to David Kohn from Timescale. Kostas, I love that we have so many different types of databases on the show. We’ve talked about some sort of time series flavored stuff, but I don’t know if we’ve actually talked with a specific time series database technology yet on the show, which is pretty exciting. I think what’s even more exciting is that Timescales built on Postgres, and so we can kind of dig into some of that stuff there. But I’m most excited about getting a 101 on time series data and time series databases. I think so much of that happens in sort of your vanilla, you know, sort of relational warehouse, right? A lot of analysts are sort of building time series type stuff. Which works fine. So why does this you know, why do we have time series specific technology? So that’s what I’m going to ask about you.
Kostas Pardalis 01:29
Yeah, well, so many things to ask actually. First of all, I’d love to learn more about the choice of Postgres versus one of these, like, database systems, that has been the foundation of a lot of innovation, for example, like Redshift was based on Postgres, and MariaDB systems. So I’d love to share, like, what’s so special about Postgres that makes it like a platform for innovation, and other BAE Systems. And why they chose to go after this direction, instead of building like store ads and query engine likes from Facebook ads, like most of their competitors are doing right. So yeah, I think we’re going to have a lot to talk about there. And it’s going to be super interesting.
Eric Dodds 02:22
All right. Well, let’s dig in. David, welcome to The Data Stack Show. So excited to talk about lots of data, things that specifically time series data things.
David Kohn 02:34
Cool. Glad to be here.
Eric Dodds 02:36
Okay. You did not start in computer science, or sort of the world of data in terms of technology, as we talked about on the show. You started in environmental engineering. So can you tell us how you stumbled into the world of databases? I guess it wasn’t a stumble? It was an inventive step.
David Kohn 02:57
Let’s, yeah, so I was working at a little battery manufacturing startup in New York City. Actually, they were like, based out of an old brewery in Harlem. And we actually began to produce some of the batteries there and had a test facility there. It was actually really nice. I was living in New York at the time I could walk to work, it was great. If no one in New York has that. So. But I sort of felt like we needed something that would work better to deal with the battery data that we’d been gathering the test cell data, like, What were these things made of, as well as just the battery cycling data. So we had hundreds of batteries that were being cycled, tested in different ways, sort of trying to do various types of tests to see how they would age, all of that sort of stuff. And I was like, Okay, well, someone needs to work on this. I wasn’t an electric chemist, or I wasn’t an electric MSPA PhD, I should say, which the other electric chemists who were working there were, I’d studied some in undergrad and done some, but you know, I knew enough to be dangerous, but I wasn’t that good at it. But I did have a computer background that they didn’t have a little bit from my environmental engineering days. So I was like, Okay, well, maybe I can help build a database to help us track this better and get people the things that they need to evaluate these cells more quickly. So they’re not spending all their time figuring out where various Excel spreadsheets are, and then trying to write analyses on them. And not doing it. Great. So I started looking into databases. And you know, honestly, like, the first one that I started with was FileMaker. Pro at the, you know, who was using that lab to connect to it because hey, engineer, Ilya, that would be the hard science side of engineering doesn’t necessarily know the good technology. Yep. And we were connecting to an Oracle database that the battery tester ran in its back end, that was okay ish. But, you know, there were problems accessing it. We didn’t have full control. There were all sorts of things we wanted to do, or it was like, Well, we really need our own thing. And FileMaker was pretty quickly not pretty quickly showing itself to not be the right thing.
Eric Dodds 05:07
Can I ask a question about FileMaker? though? Because yeah, sure. Is it? Totally. It is an origin story, though, for many things. Like, it’s interesting how FileMaker like, is in and of itself, sort of like, like first step catalysts almost. Is there anything you remember about? I know there? I know you hit the limitations. But is there anything you remember about FileMaker? Where you’re like, that’s pretty nice. Because like, for what it is, for what it was, I guess it’s probably a better way to say it like, yeah, it was kind of cool stuff. Yeah, I
David Kohn 05:43
I mean, it wasn’t like, I mean, it certainly didn’t scale all that well. But I mean, it sort of does feel a little bit like the gateway drug, to some extent, where it’s like, well, I’m used to working with spreadsheets, but this is like, this can do a bit more. But it also holds your hand a bit more. Yeah. And so I think that, honestly, I felt a little bit the same as mine. So like, the next step was MySQL for a little bit. And then I moved on to Postgres. Let me quickly after that, MySQL has a little bit more of a hand holding, and, but like, Postgres is more powerful in my mind, and has a lot more like constraints and other sorts of stuff that I think are really good. But it’s a little bit less easy to get started, etcetera, etcetera, etcetera. And so I sort of made my way down that pathway, yeah, over like, six 810 months, and then just spent a lot of time reading the Postgres documentation. Yeah. And so like, sort of, I’ve made a bit of a career of being able to read the Postgres documentation better than other people, which is, you know, how it works.
Eric Dodds 06:52
Hey, look, that’s great. It’s free education that drives the career. I love it. Yeah.
David Kohn 07:00
And so I mean, it would I think that I mean, the Postgres documentation is, I mean, if you’re willing to dive into it, it’s pretty technical. You have to, like, really get in there. But it’s really good. It does describe a lot of stuff really well. And like, I mean, but that was also that he had ways. So I quickly got into like, using things in the ways that they weren’t really meant to be maybe that’s my, you know, that’s part of why I like engineering, actually, in general is like, that’s the creative side of it is like, you figure out how to turn this thing and shove it in there. And it works. Yeah. And so I always sort of found myself doing that. And I mean, even to try and deal with time series and Postgres at the beginning, like it wasn’t great at it. And so I started futzing around with arrays and all sorts of other stuff. Oh, wow. Yeah, that Postgres has, and then that actually sort of, and then I went to a Postgres conference. And that led me to my next job in databases, because I started talking to someone about toasts, and arrays, and all sorts of other stuff and what I was thinking about doing there, and he was like, You should come work for us. And that’s how I ended up at Mote, which is an analytics company. They had switched, you know, their back end from MySQL to Postgres. And we had all sorts of other data warehousing type things that they were working with. And there were a few people there who were really into Postgres, and thought I could do more. And the flexibility that Postgres allows in terms of data types, and some of the other interesting stuff that it could do allowed us to build some really interesting stuff. So we ended up sort of taking out Vertica, and Redshift from our back end, at least our customer facing back ends. And replacing them with some, like, a couple of different actually, like Postgres back ends, that were pretty highly tuned for our specific use case, like we had, essentially a data warehouse. But for some reason, our app always wanted all the columns, and we had like 120 columns. And our app always wanted all of them. So you throw out a column store. And also you’re like trying to serve a front end off of this, which means like, a column store is usually made for, like, relatively large columns, columnar scans, and relatively few clients, and now you’re trying to run a front end off of it, or you have a fair number of clients, and you’re always getting all the columns in the column store is just like, Nope, sorry. Yeah, that’s not what I was built for. Oh, that’s kind of what I said. Like when you said before, like, you started to go down this path of like, trying to do time series, ask stuff. It’s like, okay, well, things are gonna get like,
Eric Dodds 09:28
pretty gnarly, like with vanilla, Postgres pretty quickly, but it sounds like you actually. So you start to actually do some cool stuff.
David Kohn 09:36
Yeah. So at Mote basically, I helped build a data warehouse, like a data warehouse replacement, that would that sort of stored all of its columns in a composite type array, and would like gives you all the columns back, but it compressed it a bit, it uses a lot less IO, in order to do it. And, like, we ended up you know, swatted, like getting rid of Vertica MQTT and Redshift and just thinking with that, and another one that we built that was like using Postgres FTW to, and another person mostly worked on this to basically it was basically like Athena Lite, where we actually had compressed sequel lite files and S3. And then we had, like, we would kick off all of these lamda jobs to go and grab sequel lite files and pull them up and run some queries on them. And, I mean, that was where I started using a time scale. So I started using a time scale of, like, 0.50. Wow. As you know, mostly because it was doing partition management. And I had already started doing partitioning and all sorts of stuff to handle that data warehouse that I was talking about.
Eric Dodds 10:42
Mine times, y’all, by the way, because good. Well, it’s okay. So partitioning is like that sort of the like, the entry point where you were like partitioning, and you were trying to figure out a better way to do that. And so then you sort of that was, yeah, so
David Kohn 10:55
I gave a talk, actually a PG con, about the data warehouse that I built. And I met the timescale folks there. But I was like, oh, I should start using this because it’s a pain in the ass to make all of my like, do all of the partitions that up manually, and they were doing it more automatically. And there were a lot of reasons why that was really nice. And there were a few, like key things that really made that really good for us. So I started using it very early. And, you know, we weren’t, like it wasn’t as full featured as it is now. And we weren’t taking advantage of all of it, I’m sure. But like, it was really helpful for what we were trying to do there. So yeah, it was all like, but it was all sort of like the partition management type stuff that we needed to do.
Eric Dodds 11:41
Yeah, totally. And with moat analytics, can you explain like, what were you? Was it a tool where people would like, run their own analytics? Was that a way for, you know, companies who needed to serve analytics to their customers? What type of analytics were Yeah,
David Kohn 11:57
so it’s ad analytics. So yeah, okay, got it. Yeah, advertising analytics. And so there was a whole side that was doing JavaScript and pixels and all of that stuff. And there was a whole real time processing thing that we had. And then what I was mostly looking at was Postgres. Yeah. So once we had some of the like, essentially, the daily aggregates, yep. From that real time pipeline, we were ingesting that into Postgres, for all of our customers across all of the segments and other sorts of slicers and roll ups and slicers. I don’t remember exactly which one. But like, you know, a few different ways that we could slice and dice data, and we had a whole front end based on that. And they were using it for their effort. So that front end was what we were powering, we had a whole thing. Also, that was quite fun, where we actually took a lot of the PHP that was powering that front end. And we took a lot of the calculations out of PHP, and moved them into these entirely ephemeral Postgres instances. So we basically, because PHP would, you know, some of these calculations, it was like, I want all the memory in the universe, and that it would die. Because PHP. And so, we basically said, well, Postgres could do all these calculations really nice. And so we just built this thing where it was like, Okay, we have these, like, they’re basically stateless, they would actually reach out to the other Postgres backends and other types of backends, that we would use the foreign data wrappers. And then they would basically just serve as the computation engine for a back end. And all the PHP was then doing basically was writing SQL queries. That would run it would create some like temporary tables that weren’t exactly temporary tables, but basically temporary tables, put the data in, and then run some queries on that. And then, you know, those would go away after some time. Yeah. And that worked really nicely.
Eric Dodds 13:57
When really, I mean, that’s, that’s really kind of like, headless flavored architecture, which is super interesting. Yeah. Yeah. Yeah. So very cool. Wow, that is fascinating. Okay, so let’s zoom out and just talk about time series for a minute. So you talked about partitioning, right? That’s one that is one component of this, but generally people know and love Postgres, right, you, obviously, you know, you’ve studied the documentation, and built a successful career off of that. And Postgres is awesome. Can you just give us a quick 101 On Time Series databases? And then of course, on the like, we cannot we have to talk about why Postgres actually cost us. You probably need to take to the web, you know, why? Here’s some Postgres which I can’t wait to hear about. Can you just give us the 101? So like time series data, how is it different? Why is it different? Why do you use it, etc?
David Kohn 14:59
Yeah, so I mean, In this, like, so I ended up leaving moat and joining timescale about five years ago, partially because I saw the challenges there, right. So I think, the challenges and also the opportunity, right, so the big thing with time series for me is that you’re usually doing something that traditional databases weren’t really built for. So you’re usually very write heavy, and they’re usually inserts, they’re not updates. So a traditional sort of transactional OLTP type workload is much more oriented towards updates than it is to like lots of inserts, you’re usually doing some sort of repeating value, or you’re tracking states like event data, something like that. So everything is associated with a timestamp. And for us, it’s always when we talk about this, like it’s not, you know, write only or whatever it’s like mostly. So every once in a while, you want to go back and update something. But it’s a more rare occurrence. And so those are two of the defining characteristics, where you have this repeated measurement, something that you’re measuring over time, the classic examples are, of course, things like sensors, any sort of sensor, where you just like, you’re like, Okay, what was the temperature at this time? It was the temperature at this time, it was the time for that matter. That’s, yeah, or the back exactly the battery example where we had temperature sensors to but also voltage, current power, whatever else we’re measuring. But in the observability, use case in, in computer science in I mean, some things that are more typical data warehouse, he type bits, but where you’re really working on mostly, like no recent data, I would say, it would be more time series ish, where I think that maybe distinguishing between data warehouse and time series data is slightly harder, in some ways. I think one of the things that
Eric Dodds 16:53
That’s a great statement.
David Kohn 16:54
Yeah, so I think that often time series and data warehouse type data have a lot in common, right? There’s a lot to do with more columnar oriented data, where you’re doing, you know, with a data warehouse, you’re often not doing updates here instead, like racking up dates, and then treating them as inserts in the data warehouse. So you have a point in time where a value changed. And that’s like, treated as an event essentially. And so it looks a little bit more like a time series. And I think that those lines are blurry. But it’s something that we’re trying to figure out. Yeah, where do we fall in that, but I think in general time series tend to be more read heavy, in relatively recent periods is a lot of the type of workload that you’re doing, we’re gonna have more clients, where it’s like, I have dashboards for all of my clients to see what their sensors were doing. They’re mostly looking at the most recent data, but sometimes they want to go back and dive deep into a certain period. Whereas more data warehousing workloads, I would say, you usually have fewer clients, they have, they want to scan most of the data to do a big machine learning job on basically everything. And that’s the sort of distinction that I would make. Yeah. And so there’s, but there is, I think, a little bit more overlap there. If that makes sense. Yeah.
Eric Dodds 18:15
So when you go back to like, let’s go back to the moat example. Right, where you were, you sort of adopted a timescale for, you know, partitioning purposes. But if you had to start over at the moat, and sort of rebuild everything, how would you build it today?
David Kohn 18:37
I think that some of the choices that we made were actually pretty good for what we were doing. Because a lot of our queries just wanted everything, we actually made some good choices, and some of the things that we would do would actually be somewhat similar, right, for right now, assuming that constraint is still there. I mean, even with our compression, in timescale, like we have implemented compression now on top of postgrads. And that could work nicely, except that it’s still column oriented, to some extent. And if you’re really trying to get everything on every query, like, we still have some overhead to do the projection. Sure. And bring all the rows back. So you know, you’d have to think a little bit about that, you can still use composite types in our compression, there’s various things that you could do. But we had a lot of things where we sort of self implemented something like continuous aggregates that we have. So I would definitely use those things that’d be nice to have different levels of roll ups of this data over time. And that’s one of the things that we do to help make some of those queries that might want to scan backwards in time or do something like that a little more, maybe data warehouse to some extent, where you’re taking something and you want to compare it to a baseline over a year. I want to know This temperature, like how it compares to the other temperatures? Over the course of a year or, you know, the battery voltage at this hour? Like, how do I do that in a reasonably efficient manner without having to rescan all of the individual points? It is a big question that’s really important to try to answer and get the newest aggregates and things like that really do help. Do that. And that’s something that we certainly, like, it was. That was a big problem for us. Like, that was something that like the data scientists science team would sort of do once a month. Yeah, or something. Right. Yeah. And then they published the baselines. Yeah, and like, wasn’t all that flexible, it was hard to do. There were things where it was like, it was just difficult. And so I think that there were things there that we could make fit, you know, with some of the stuff that we have now, we can make it a lot easier. With that said, I mean, there’s also other stuff that I think would be really interesting for some of the real time. stuff. I heard a little bit of some of the stuff you talked about with materialize a little while ago. Yeah. And that sounded interesting to me for some of the stuff where we had a hole in memory, like crazy things for doing the real time stuff. Back there, and that seemed like it could be an interesting type of system.
Eric Dodds 21:19
Yeah, totally. Remember? Yeah, I mean, yeah, Postgres day run, run on Postgres natively. Super goes up. Okay, one more question before I hand it off to cost us. This is so interesting to me to hear about what you were doing with Postgres? Were you sort of looking back on that now? Were you sort of doing some things with Postgres that were maybe more rare? And did you find it? It’s a little bit of a leading question, because I like, probably was, did you find other people in the community who are just Postgres is actually a good way to do some of these things? I mean, of course, he was sort of stretching it to some of his limits without something like a timescale before you, you know, before you adopted that, but what was that like to sort of be breaking new ground? I don’t know the right term to use, but doing some things with Postgres that, you know, we’re not really conventional. Yeah, so I think,
Kostas Pardalis 22:21
I mean, one of the things
David Kohn 22:23
What we were doing was packing rows into arrays. And this goes back to sort of a time series type stuff where, like, you’re often not, in the same way, as in an all TP type workload, dealing with individual rows, you’re more often dealing with sets of rows that you want to aggregate somehow, in order to make them more meaningful. That’s definitely a characteristic of time series data, I’d say. Like, at some point, when you have billions of rows, it’s just like you have to, in order to make sense of them, there’s like, there’s just no choice. So one of the things like we were using toast, which is my favorite acronym, and Postgres, at least, and maybe in this role would be oversized attribute storage technique, which is basically what Postgres does when you can’t fit a row onto a page, it takes it in, chunks it up and it stores it in another table. does that for you automatically. You don’t have to do anything. It just yeah, it says, I can’t fit this on a page, I have to start somewhere else. So we were using that essentially, to our advantage to say, Okay, well, we’re usually going to access these rows together, I don’t want them randomly distributed, I don’t want to have the overhead of Postgres is row overhead and some of the indexing stuff, I actually want to be able to index to a group of rows. So we’re storing an array and letting toast do its thing. And like, then take that out. And like that was a thing that was not at all, like Postgres was just trying to figure out how do we deal with this problem where like, sometimes rows are too big for a page, right? That’s not what they were trying to build. But it turns out, it solved our problem really nicely. And some of the folks in the community were like, okay, cool. That’s great. Yeah, right. And it’s actually some of the basis, I mean, we have more advanced techniques for how we compress the data and store it out, but it’s still using the toast mechanism to store it while flying. So like, in timescale, we use a modified version that’s more performant and written in C rather than Sherif, el PGSQL, and all sorts of nice stuff. But in the end, like, that fundamental thing that Postgres built, which is like, how do I store this, like a larger block of data out of line of the main table? And this sort of insight of, okay, I want to take that I want to have it correspond to a set of something rather than to an individual row. That’s sort of the fundamentals still the fundamental basis for some of what we’re doing there. And it also means that the main table is very small, right? So if I store a little bit of metadata about what’s in that row, then I can Avoid going and looking in toast for that section. And I basically implemented a skip list that’s common in a columnar database, but I’ve done it inside of postgrads lickety split. Right. And so, you know, that’s the sort of stuff that we were doing a little bit of it mode, and we’ve done a lot more of a timescale.
Kostas Pardalis 25:19
Yeah. And
David Kohn 25:23
he, it’s just taking what was built for something else and repurposing it a little bit and flipping it on its head. And it’s one of the things that I really do like about Postgres is that they have all these extension hooks and other sorts of things. I mean, even just the way that it was originally built to accept different data types. So you can define your own. So that was more extensible. Like that’s been built into Postgres from its beginnings. And so, yes, it’s something new. But it’s also part of what, like, and it’s a slightly different way, I think we’re taking it a little further than maybe most people have. Yeah, but it is something that was imagined from the beginning of this project. Yeah, 30 years ago. Yeah, they wanted it to be more extensible, wanted it to be something that was more powerful, that you could build your own thing on top of it and customize it more, and it would be more flexible to work with different kinds of data. And so that has been there that has been a through line of what Postgres has done for its entire history from, you know, the days back at Berkeley, I guess.
Eric Dodds 26:26
Yeah. So I love it. I love it. No. Spoken like someone who has been a diligent student of the Postgres docs. No, no, super helpful. Okay, Costas. I’ve been monopolizing you so please, please jump in. I could keep going. But you’re probably gonna take a couple of questions out of the tip of my tongue.
Kostas Pardalis 26:48
Yeah. Thank you, Eric. So David, I was listening to what you were saying about building a time series database, right on top of Postgres. And what I find, like, super interesting here is that there’s like, it almost sounds a little bit contradictory in a way, because we make the claim that like, Okay, time series data, like special enough to require their own, let’s say, way of working with it. But at the same time, like, you decide to use, like a technology that initially was not built for that, to do that, to do it, right. And I’d love to share like, the wide fares, and then we can talk about the how, but why not go and do what’s what’s a out of like all the rest of the type of the time series database, she’s out there have done we choose to create their own DBMS system, right? And go and do it on top of Postgres, like why did you choose that?
David Kohn 27:54
Yeah. So I thought about a few things, knowing a little bit about so I joined a timescale number of about 11 people. And it had been like, so it was, it had been something else before that. And timescale actually came out of like, one of our sort of very early employees, trying like saying, Maybe I should still do some Postgres because we tried some time series databases, and they weren’t, like really meeting our needs very well. And so like, that was, the early history of timescale was like, this isn’t working, why isn’t it working? Well, I actually want some of the things that come with, like the standard database approach, like I want to be able to use SQL, like, it turns out, it’s pretty powerful. There’s a lot of stuff it can do. And I kind of know how to use it. I want to be able to do some of the things like join a reference table that has the information about the sensors that I’m using. Right, I want to just have that there. So like, some of what we wanted to do is that your OLTP workloads and all of the information about what the time series are. If that lives somewhere else, then it’s really it’s actually it actually becomes really hard to use your time series database effectively. Because you can’t do something like a joint you can’t ask the questions about all of this other information that’s out there very easily. You have to then essentially write your own application to do that. So those two things living side by side in some way or another was one of the big reasons to do that. There’s also all sorts of stuff around animosity, consistency, the ability, you know, to, it turns out, like when you write your own engine, you have to make some hard choices. And like often in time series use cases it’s like, well, you can’t actually go back and rewrite history. Like we can’t do out of order inserts. And like in time series use cases. Out of order inserts are rail. Like they’re free. went compared to the overall number of inserts, usually, but they can still be really important. Yeah. Right. So there were some trade offs that we were making that were like, well, we get this for free, right? We get the drivers for free, which I think is, you know, can be good and bad. To some extent, I think that there are trade-offs in all of these approaches. And like, there are times when Postgres has held us back to some extent, right, like, it is a bear. Yeah, work with a 30 year old, see codebase, with however, many million lines of code in it, and figure out all of the interactions and the weird shit that you have to deal with. But it also means you get 30 years of them seeing all of these use cases in the wild. And where that weird shit, like, there’s a reason for it, someone needs to do that. And so it gets you somewhere in terms of like, what the real world use case for a lot of this stuff is and figuring out where you can deviate from that. And where you shouldn’t, has, I think, been one of the big challenges for us. But I think there’s a lot of value there. It also means that you’re less likely to lose data, if you like, which is important. Not always the most important thing. And sometimes series use cases where it’s like a duplicate. It’s a duplicate store, but like, you already have various knobs that you can twiddle in Postgres, where you can say, well, I’m okay, maybe if I lose a little bit of data here, like there are things that you could do. But I think having some of that consistency built in all of that stuff, really helps you get started and build something out. That is easier to work with. So I don’t know, I don’t know, if I actually answered the question there, if that was helpful, but yeah,
Kostas Pardalis 31:51
oh, it’s 100%. I mean, I think the history of database systems is like a history of three dogs in general, like, I mean, you take it like from an abstract point of view, like, everything, like you can define like, the law bases, and then that is universal, right? At the end. It’s all about storing and managing state. I mean, the user likes to perform some computation on top of that, right? The thing started, like getting interesting when we need to start making trade offs. And these trade those words, is causing, like all these flavors. And of course, like all the opportunities to build products and businesses on top of that, right, like, so. There’s definitely, I think, if there’s something that has been proven, is that there’s no one database system that can, you know, manage all workloads at all times. So, but a follow up question to that. Okay, you answered like why. But let’s talk about the top right. How do you take something as complex as being like a database system that has been built for, like, 30 years? More than more than that? And you build on top of it, like completely different, completely? Different database system? Right, like, what are what I would like to hear?
David Kohn 33:13
I think it’s both, like, from what, let’s say, timescale has done to make this happen. But also words, Postgres offers the allowance. Timescape to do that, right. Yeah. So I mean, so there are a few things that I think are key. And Postgres has had an extension system for a long time. And that extension system has sort of, I’d say, sort of, two different sorts of interfaces that the extensions can use. So there’s one set of interfaces that is basically entirely in SQL. And it’s basically like being able to create types, being able to create functions, being able to create aggregates, being able to create operators. And actually, the part of the product at timescale that I work on, the Toolkit team, is written in rust, and it mostly interfaces with Postgres at that level. So types, functions, operators, all mostly in SQL, very little work behind the scenes on the C code. And then there’s another set of hooks that Postgres provides that are essentially places in the actual C code, where it allows an extension to take over control, and provides an API inside the C layer, where you can sort of take control and do something different. And the main times you’ll be extended work more heavily on that side. And so that’s a it’s, and that’s, you know, that’s that is that can be hard. There were some clear things that we wanted to do to improve the partition management stuff that Postgres has, like, gotten much better at over time. So when we started, it was on nine, six and declarative partitioning wasn’t even really a thing. Postgres. So we’re still working with the old inheritance based partitioning system. And over time, Postgres has gotten a lot better there. We’ve also gotten a lot better, and some of the hooks are a little easier to work with. But then there’s all this stuff around compression and continuous aggregates and the other sorts of hooks that we need to look into to then take over processing and certain types of just optimizations that we do, that are more specific to time series data, the way that we order the partitioning, and can do various types of constraint exclusion, or partition exclusion inside the planner. And inside the optimizer sometimes at execution time. We’re working on some things around aggregation and making that faster at execution time. So like, there’s various things that you can do in various ways that you can hook into Postgres. And it actually offers a lot of these hooks that I think are really helpful for both. Like, I think that we helped also drive the community forward. Like, in some of the senses, and it’s a great way for the Postgres community to have innovation happen. And let them sort of at least see that there’s something going on here that they need to respond to. But don’t, they are often a little bit slower, just because they’re a huge project that has 30 years of history and all that stuff. They release a new version once a year with a new version. And like, some things that if you want to contribute back upstream, you might not see it for two or three years, which is a long time. Yeah. And so I think that, you know, we now have a team that’s contributing back upstream more, which I think is great, it’s really good for us to be sort of giving back in that way. But, you know, you see that, it just, it takes more time. And there’s, but, of course, there’s a lot of benefit to that for us too, because, like there are a lot of a lot more people working on Postgres that are working on timescale. Still, right, like, that’s gonna be the way it is for a while probably. And so we get the benefit of their, of all of that work. Because we’re an extension, it takes some time, also for us to make sure that we’re compatible with the next version and all that stuff. Right. But that’s so those are the trade offs that we see. We very much wanted to be an extension, not a fork. There have been a lot of successful projects. I think Vertica is one of them Redshift and other Postgres in the backup past, of course. But we didn’t want to go down that route, we wanted to make sure that we stayed an extension so that we can keep benefiting from the community as well as giving back and feeling like we had a you know, because a lot of the projects started uncorking at some point, they’re like, Oh, my God, all these things that we want, have gotten into main, mainstream Postgres, and like, now, we’re screwed. Because we’re a fork.
Kostas Pardalis 37:56
So does it make sense? Okay, you mentioned like, they’re like two, extension to extension systems, like, in a way and you’re working like one of those has to do with extending the seat? Well, it’s you also mentioned that, like, your rights in Rust there, so can you like, get a little bit more into like, the details for that? Like, first of all, why do you need, like, languages outside of a sequel to do that? Right? What’s the reason for that? And how does it work with something like rust? Because I would assume like, you know, the system was built a lot earlier than rust itself. Right? So how does this work?
David Kohn 38:38
Yeah. So Postgres can work with lots of different languages, and it can sort of hand control over to them. As long as you sort of stay within some of its memory allocation type stuff, it works fine. And it even has some languages that you can run from within the database and other ways. But wait, so we actually work with an extension framework called PG x in Rust, that allows you to write extensions in Rust that then hook back into Postgres through this sort of sequel interface. Most of why we’re writing in something like Rust is because we are writing new types of, say aggregates, or functions that we want to have a bit more control than SQL, like you could, I mean, SQL is Turing complete. So it’s possible. But like, it’s not necessarily efficient. To write it that way. We want to have more control over what’s going on in the innards. Whether that’s and so we’ve worked with so so a lot of the stuff that we do on the Toolkit team is like we write some things around like data sketches, like hyper DBLog and some of the frequency type accounting count min sketch. We do things with things like counters and, and other sorts of things that just make analysis easier. So when we write we like when we read Custom aggregate or custom type, we’re giving something that then you can use inside of SQL. And it makes the interaction with SQL and with the rest of the database easier somehow to solve a particular problem. That’s really what we’re trying to do there. So why did we choose rust? Essentially, because it has some nice qualities around memory safety and, and other bits that we thought would be good to take advantage of, as well as some nice libraries and stuff that we’ve used for some of these to solve some of these problems. Yeah,
Kostas Pardalis 40:35
yeah. 100%. So okay, I think getting an idea of like, one part of what, like, timescale DB is like adding on top of Postgres. What other secret sauce is in there? Right? Like, what are the contracts that timescale? Introduces that they are not native to Postgres, right? Yeah. So.
David Kohn 40:58
So one construct I talked about a bit, which is the concept of a hyper table, which actually comes from the way that we do some of the partition management. So it’s a way of treating a set of partitions more like a normal table. The partitions are created automatically as data comes in. And so that’s one thing that we add on and the name actually comes from the concept of a hypercube, and multi multi multi dimensional spaces. And how that hypercube sort of gets segmented is that an individual partition has some disjoint set of constraints around it about what data lives in that partition. So that’s one thing that we build on top. Another thing that we build on top is something like compression, which really reduces the size of the time series dataset, it takes advantage of some of the qualities of time series data, reduces the size and can actually increase performance on a number of types of queries, especially where you’re IO bound, which is often the case. So you can compress a hyper table by adding a compression policy, you can also add things like continuous aggregates, which are essentially materialized views that have some sort of aggregation involved, like we said, an aggregate is something that we often do with time series data, we want to have some summary statistics of some sort that we’re adding on there that allow you to better understand what’s going on. So continuous aggregates are basically an incrementally updated materialized view that also keeps track of any inserts to the underlying hyper table, and will go back and do essentially eventually consistent updates to that, to make sure that it’s, it’s going to give you the right state. Yeah. So far out of order inserts or updates, it’ll deal with those as well as doing real time. So basically, the way that the continuous area works is you have some of it materialized and some of it the query is happening in real time. So the most recent data is the queries may happen in real time. And those get put together in a nice way. And then something like the hyper functions that we work on, offer specific time series types of functionality, that then interact with these other things in a nice way. So for instance, one function that we’ve been working on that we’ve released a while ago, it’s actually an aggregate called counter Ag, and it’s about dealing with resetting counters. Now, where does that come up, it might come up in an observability setting, where you’re tracking the number of visitors to a website, and sometimes that counter sort of rolls over, it might come up in an electricity meter, setting where your utility, you have all these electricity meters. Some of them might reset every once in a while. But you’re also trying to figure out how you calculate the actual energy used in a period, when what you’re sending is the total amount of energy used overall time essentially. And like, that’s not necessarily the easiest thing to deal with, and SQL. So that’s the type of stuff that we write a function in Rust, with the right types of stuff and interactions with the Postgres aggregate system, which has its own whole set of things that it does, right. So around parallel workers and around all sorts of other fun things that Postgres has, it has a whole aggregate framework, where it allows you to create your own aggregates and deal with parallelism and all sorts of other fun stuff. But so we make that we make something that’s easier to deal with, that gives you a way to interact with it in a nice way, and define that query in a way that is more understandable, more repeatable, and solves for the edge cases, that’s like, Okay, I have some resets of the energy meter, I don’t want to have negative 10 million kilowatt hours used when my meter rolls over for some reason. That’s a bad thing. My billing system is going to be completely broken. We also, you know, we interpolate to the edges of buckets so that you can say during this 15 minute period, even though I’m measuring every two minutes, and that every two minutes isn’t exact right? I want to interpolate between the points that I have so that I get the usage that was in that 15, that I can really attribute to that 15 minute period. So that all happens inside these aggregates in a way, that then means you don’t have to do it with like, six different window functions and four sub queries. And you know, you can at least hopefully make, you know, keep it down to two. And then we interact with things like gap filling. So what do I do when I’m missing buckets? I want data every 15 minutes, but sometimes the network is down and like, what happens there? How do I fill in for that period that I missed? So those are the sorts of problems that we help people who are working with time series, to figure out how to do and work within that sequel structure to even extend a little bit what this index can do, I mean, there’s only so much in terms of extending the syntax that we can do, because like, we can’t modify the parser. That is a limitation of the Postgres extension system. It’s also like, at what point are you no longer SQL? Yeah. But like, there is a lot of stuff that we can do even just with types and other sorts of stuff. So we’re even talking about like, how do I give people access to for instance, a data frame like syntax, that I could write in SQL, and have the database figure out what to do and like, actually still be SQL that I’ve written, but it’s actually now going to be implemented almost in a data frame, so that people who are more familiar with that have a way to get started or have a way to, to do some of these calculations that otherwise would be really hard. So that’s another thing we’re considering and thinking about how we can not quite get there yet. Yeah.
Kostas Pardalis 46:44
So yeah, that’s a great point. Like i actually like talking and like, I keep thinking about, let’s say, the interface with a user and like the experience of the user has, right. And you said many times, like, again, that’s something I find super interesting, like, secretly something everybody knows, like, or can start working with. But then you hit some limitations on how you express things and things start getting like ugly, and suddenly, you need to add stuff there. So I’d like to hear from you. Because you’re like one of them there aren’t that many people like out there to be honest, that they are working in trying to build experiences around things like a data system and databases. And I’d love to hear from you like how you do that. Like, I think there’s a lot of information and knowledge out there about user experience, like how to design for, you know, for an optimal experience, like on a web application. But here, we’re talking about data systems. And actually, it’s not even the system actually the data, right? It comes as an interface to the data like this, the user of the end doesn’t really care, like what the engineer is, right? Like, it cares about the data. So how do you think about that, and how do you design and build experiences, like in a system like timescale dB?
David Kohn 48:19
Yeah. The answer is, it’s, it’s tough. You know, that’s to beat traders up, like, but yeah, I mean, so it’s, it is a difficult thing, I think, you figure out how you make something simple enough that someone can use it, while also powerful enough to really make it easier to work with. And thinking about how you can save of that distinction and what you’re going to leave to the user, what you’re going to have the database do what you actually want to do somewhere, maybe outside of the database, like in a driver or something like that, that’s something that we’re beginning to think more about, where you’re actually going to like, which parts of the system should be optimized for ease of use versus versus which parts of the system should be optimized for flexibility. And, sort of power, expressiveness. So then you can build something on top of it. That’s going to be more performant. Or be easier to use or whatever else. Right. So it’s, I think that there’s a balance that we’re, you know, we’re still trying to figure out one of the things that we do on hyper functions is that we usually release them experimentally, before we release them fully, so that we can get some feedback from someone, often they’re coming from some sort of user request, like I have this problem. How do I solve it? Well, here’s this horrible SQL query and it’s not very performant and whatever else, okay, how can we make this better? better. So like we can we and we give them we were like, Okay, here’s an experimental version, try it out. And we try to get some feedback and see, okay, were those the right parameters and some of the things that are harder to change, we try and iterate through one or two releases, and then actually release properly. So that’s one thing that we’ve tried to do and has helped, I think, we’ve definitely gotten feedback from early users and then gotten more users as we release it more fully. But people who have those problems who have written that SQL query, they’re often willing to try something else, because they’ve seen that it’s not great. Right? And then I think it really does help to have a pretty deep understanding of what’s going on in the database back end. So that you know a bit more, where are the leverage points that you can hit that really are going to impact performance. And this is one of the things that I think, is still a struggle for us, I think, you know, it’s something that we’re trying to figure out if it’s the right approach is like, sometimes people want their SQL to just be generic and work on any SQL database, but like, you’re going to get a lot better performance. In many cases, if you learn, take the time to learn the tool that you’re working with. And write either using some, if you’re working with timescale, like there’s all of these functions that we have that really make that analysis easier if you can write less generic sequels, but like something that works with the stuff that we’re good at. Yep. Because we give that to you. Means you’re less portable. But honestly, who changes their database? Like once you’re in production? Like, that’s hard no matter what. Yeah. So like, I think sometimes people have gotten obsessed with that, in a way, that’s bad. And so that’s one thing that I think like, as an agenda for me is like, learn new tools, write stuff in the way that takes advantage of the things that they do well, and you’ll be much happier with the results. So but also, we’re trying to figure out how we reach people better and how we can interact with them more and have more of the types of documentation that reaches them where they need. I mean, one of the things that we’re finding is something like this counteracting right there. There are, and the two use cases that I mentioned, right, the electricity meter use case, and the website observability type of use case, they have the same sort of problem. And we have some functions that are going to really make that much easier to work with. But if I’m the electricity meter person, I don’t use the same words, to refer to the problem as the person who’s on the observability, like, web side of things, no, like, you’re going to talk about a counter. That’s like, depending on who you are, it’s like a counter that’s like measuring web traffic or measuring network traffic. These are counters or something like that. In the electricity sector, it’s meters. And the problem of resets is a little more rare. It’s a little, but it’s really important that some of the terms and the problems that they’re trying to solve are just subtly different. That makes it hard for people to reach it. So I think we’re, you know, the strategy that we’re trying out now, and we’ll see if it works is like we’re going to write more shorter contents focused on specific sectors that just like our problem, solution, problem, solution, Problem Solution, give people more like recipes. So it’s like, if I have this problem I can find an answer to it is one of the things that we’re trying to figure out and like, that almost becomes UI. Yeah. Right. Like documentation, and examples and articles. That’s your UI as much as anything else. Yeah, yep. Right. And that’s, it’s really, it is a different thing. It’s hard to figure out how to do it. I’m not sure that we’re doing it right. We’re trying. If you have feedback, if you have thoughts, please let us know. So yeah, that
Kostas Pardalis 54:31
It was fun. All right. So one last question from me, and then I’ll give the mic back to Eric. So how important do you think that the user of developer experience is for the success of a system wide timescale to be for a database system?
David Kohn 54:48
I think it’s pretty darn important. Because, I mean, I think that in many companies now the developers are And the users drive adoption as much as especially around databases. They drive adoption as much as anything as much as anyone else. Yep. Performance matters a lot. But there’s also like, a lot of things aren’t good enough. Yeah, a lot of use cases. And so then what you’re trying to figure out is like, am I being productive? Yeah. Can I write? Like, is this saving me time? I can write less application code because I can do more in the database. That’s really valuable. Sometimes that also massively improves performance. Like if I don’t have to send data over the wire. And I can, I mean, this is like some of the stuff that we’ve built around the largest triangle, three buckets, I don’t know if you know this. It’s a visualization algorithm that does down sampling of data, essentially, for graphing to maintain the shape of a curve, you can do it on the database side. Like, it’s actually that I can hand it off, and the client doesn’t have to do as much. So there’s things that you can do to make it easier for something to work in the database that then really improves performance down the rest of the stack. And so I think, figuring out how to do that is something that is a challenge for database folks. And something that I think if we can get right can be a real competitive advantage. It’s like how do I make this UI easy enough for people to use but also powerful enough that people want to use it for more things, and move more calculations closer to the data? Because that’s going to have compounding effects, in terms of performance, and in terms of all the other stuff that they need to do with the rest of their time?
Kostas Pardalis 56:42
Makes total sense? All right. The stage is yours. Again,
Eric Dodds 56:47
I have one more question. I love that Brooks isn’t here and we can push it a little. So great. Okay, you know, people who are technical and like solving problems, like to sort of push their tools to the limit, right. And you exhibit a bit of that with Postgres, you know, sort of pre timescale, which I love. How does someone know when they need to implement time series specific functionality? Because there are certainly companies who can probably do some things at a smaller scale than you were facing a moat with Postgres and using some of those approaches. And that’s fine. We’re a data warehouse, right. And that’s fine. But like you said before, it’s, you know, you can do a lot of things that eventually don’t end up scaling, because you don’t know that a time series solution exists. So if you know, when do you need to actually adopt a time series database? Yeah.
David Kohn 57:51
I mean, I think there’s a couple different reasons that I would think about, or things that I might see that I, that would help me think that it’s time. Actually, let’s talk about two different things here. So one is the UI. Yep. Right. So. So if you’re having trouble doing the thing that you want to do in bog standard sequel, like if you’re finding that you’re writing this thing, and it’s like, the sequel just seems excessively hard. There might be a reason to just adopt something like a timescale that makes that query easier for you to write. Like, there, right, that might be it, right? Because we’re gonna make that query easier, more performant, when you’re seeing things like, I have materialized views that are taking a long time to update because I need to cache some of this stuff, because I want to serve my UI better. And I’m doing aggregates in them. Like, that’s a great reason. And that’s more somewhere between scale. Right? And user interface, right? Yeah. help tweak that interface? Yeah, in a way, that’s like, Okay, that helps me solve this problem. I need to not have my materialized view, recalculate everything from the beginning of time. Yep. Every time I refresh it. That seems silly. And it’s taking a long time. I’m like, that’s another good reason. And then, if you’re seeing really high ingest rates, where the ingest is slowing down, if you’re seeing really high disk usage, I mean, one of the things that ends up holding a lot of systems with time series back is just that disk usage. It’s why compression is so important for us, like we’re seeing, in some cases 10 or 20, sometimes even a bit more times less space use. So you’re thinking you’re now fitting your data into 10% of the space that has an impact on the rest of your system, as well in terms of where your IO bandwidth can go and what can be used. Usually database systems aren’t CPU bound. There you Usually IO bound. And if you’re finding that, like your OLTP type workloads are struggling because the of the IO boundedness of the stuff that you’re writing for logs, and you know, all of the time series analytics data that you’re writing, they’re like, it might be time to think about it. And it might be okay to just move, just move the time series part over to start and maybe connect back to it FTW to the OLTP data that you have in the other database, maybe that’s where you start. And maybe eventually you have them all side by side and it’s nice there. Or maybe you keep it that way. And that’s fine, too. So they are different, those are some of the things that I would start seeing, when you start seeing weird locking issues. I mean, a lot of the stuff that we’ve had to deal with is like, how do we do this in as lock free away as we can, how do we do this operation, so that we’re not taking a random lock on this part that that is going to mean that my insert volume has to drop, because I have this lock that sequenced into my scheduling in a way that like, Oh, this is weird, what’s going on? Yep. So there’s a lot of stuff around that we try to manage and help make easier. I think also even just like dropping data, if you’re doing things like, like, it’s really simple, in some ways, like we have these functions called Drop chunks, where it’s like chunks are the part of the hyper data partitions of a hyper table. That’s how we call them where it’s like, I just want to drop data that’s older than two years. In Postgres, let’s delete. And then you need to vacuum. And you might not reclaim space, and it starts slowing down inserts, and there’s like all this stuff. So as you start getting into those types of problems, those are some of the things that people like you do want to start thinking a little bit ahead on. You know, when you have those data retention type policies, that’s another thing that I think it really helps to have some time series database to work on. It’s like, I want it like Postgres doesn’t think about it in the same way. Because it’s like, that’s not the operation that you would normally do in an OLTP type context. But it is really important in a time series context. Some point, things need to fall off the back. Yeah, so those types of things. There’s just like, this is a more time specific thing. If you have time data, and it’s like, getting bigger, you’re trying to do these more time oriented operations. We’ve thought about those things in a way that Postgres, which is much more general purpose is just like, they’re not going to do as much of that’s just how it works. Yep. So,
Eric Dodds 1:02:30
Yeah, I love it. No, I think that’s, uh, we had a guest recently say, new, we think about sort of the advancement of technology, like, what does that mean, people are worse at managing databases, you know? It’s like, well, no, it just means that they can focus on problems that are hard for, as opposed to, you know, trying to solve difficult problems with an instrument that’s not quite, you know, built for that specific purpose. Yeah.
David Kohn 1:02:59
And more specialized, right. So the goal is that you can solve the problems that are specific to your business. And that you really need to solve, right? You know more about your customers, and we’re ever going to, we want to give you the tools to help you solve those problems in a simple way, so that you’re focused on that, though, like, not on this weird, like, how do I drop these things and then have to vacuum and figure it like, that’s not your customers don’t care about that. But they see the results if you don’t do it, right. Yep. So
Eric Dodds 1:03:35
yep, I love it. I love it. Okay. So listeners who are running Postgres and want to dip their toe in the water here, where did they go? What did they do?
David Kohn 1:03:49
Yeah, timescale.com is a great place to start. Even if you’re not running Postgres, I’d say that there’s actually a fair number of MySQL, we see a fair number of people on MySQL and things like that, where they’re like, Oh, my sequel? Yeah, totally. And partitions in MySQL are like, they get weird. Sometimes they’re not managed transactionally so bad things can happen. So there’s, you know, migrating from Postgres, from MySQL to Postgres, a little hard, but not crazy, usually. So definitely timescale.com We have a cloud offering that you should definitely take a look at, especially if you’re not into running Postgres yourself. And we have blogs and Docs, and I have a YouTube series that you can look at that gives you an introduction to some of this stuff. I think it’s like the foundations of Postgres and timescale. It’s on our timescale YouTube channel. Yeah, so those are some of the places you can look to get started. But I’d say you know, give it a try. We have a free trial period as almost everyone seems to these days. Yeah, get a shot and see if it helps you. So,
Eric Dodds 1:04:58
I love it. David, thank you so much for the time today. I mean, so many learnings, actually fun to just dig into some Postgres guts. to So thank you so much. And we’d love to have a future show sometime in the future.
David Kohn 1:05:18
It’s really fun. Good to meet you guys.
Eric Dodds 1:05:20
We just chatted with David cone from Timescale DB fascinating technology. I think one of the big takeaways that I had was how extensible really good technology is, you know, he talked about Postgres sort of originating 30 years ago, right. You know, we sort of tend to talk about the last decade of change in data, right? And how much has happened, right, and data warehouses and data lakes and all that sort of stuff. Right, we’re going back three decades, it was just amazing to hear David talk about how many of the things that make timescale possible in terms of dealing with time series data were conceived of as sort of functionality at a base level and Postgres 30 years ago. Yeah. Which is really cool. And I also think it just speaks to not only David’s ability in terms of how far he pushed it, which, you know, like, it’s amazing how far he pushed Postgres you know, without additional technology on top of it. But also, just to the wisdom of the people who architected the original system, so that’s what I’m gonna take. I mean, it really is, it is pretty incredible.
Kostas Pardalis 1:06:49
Yeah, yep. 100%. I mean, there were like, so many lessons learned from like, this conversation with David, like, first of all, it was amazing to hear about, like, all the different, like, ways that you can work with something with Postgres, and like, the tools that it provides to extend and build on top of that, to the point where like, you pretty much end up like building almost like a completely different databases. At the end, that was super, super interesting and gave me a lot of food for thought and I like to go and take a few things about Postgres. But well, they also enjoy it, because it’s like one of these, like, rare occasions where we have someone who is really deep into database technology, but at the same time, like the product side of things. We’d like the conversation around how important the developer experience is, for delivery in terms of their success, and how it becomes more and more important, the end, like these days, and he said something like, I don’t know, like, it was like, super, super interesting to hear that. Like, what performance is like, good enough. And when performance is good enough, the question turns into, like, how I can become more productive. And that’s where the experience becomes super important. Yep. So I don’t know, that’s probably like, just like, let’s say part of like, the conversations that we’ve had, but I think it’s like a great indication of like, the whole conversation we’ve had, and hopefully we will have you back soon again.
Eric Dodds 1:08:35
I agree. All right. Well, thank you for listening. Subscribe if you haven’t told a friend and we will catch you on the next one.We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.
Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
To keep up to date with our future episodes, subscribe to our podcast on Apple, Spotify, Google, or the player of your choice.
Get a monthly newsletter from The Data Stack Show team with a TL;DR of the previous month’s shows, a sneak peak at upcoming episodes, and curated links from Eric, John, & show guests. Follow on our Substack below.