Episode 82:

Databases: The Fun Never Stops with Robert Hodges of Altinity

April 6, 2022

This week on The Data Stack Show, Eric and Kostas chat with Robert Hodges, the CEO of Altinity. During the episode, Robert discusses ClickHouse, real-time, the future of database technology, and more.

Notes:

Highlights from this week’s conversation include:

Robert’s background and career journey (2:21)
How studying languages influences database work (5:13)
Why Robert has been working with databases for 40+ years (7:50)
Explaining the ClickHouse database (10:43)
How ClickHouse is able to focus on latency (13:39)
The use cases behind ClickHouse (19:19)
How ClickHouse is different than other databases (25:47)
Why old problems are just now getting addressed (29:04)
How ClickHouse works with others against another (33:03)
When to implement ClickHouse (38:53)
The distance between ClickHouse and the end-user (42:24)
New database technologies (47:02)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 0:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. And don’t forget, we’re hiring for all sorts of roles.

We’re gonna talk with Robert from Altinity today. Super interesting guy. He’s worked with databases for almost three decades now. Three decades, that’s 30 years.

Kostas Pardalis 0:44
More than that?

Eric Dodds 0:45
Probably more than that. Oh, yeah. Four decades. You’re right. Wow, I’m bad at math. So super interesting guy, works on ClickHouse and has some services on top of ClickHouse, which is super interesting. Here’s what I’m going to ask for decades, and databases is a long time. A lot of the people who I’ve talked to who have worked in databases early in their career, eventually move on to do something else. And I want to know why he stuck with it. I mean, look, databases are great, I’ve seen a pattern of you sort of certain databases, and then you do something else that’s there for decades. I mean, that’s some staying power. So that’s what I’m going to ask about you.

Kostas Pardalis 1:26
I want to learn more about ClickHouse. And I think we have the right person to each other about both the technology and what makes ClickHouse such a special technology, and databases, but also about the use cases, because some other technologies, we could read and be novels now in the same space. So to be great to understand a little bit better, why we need this new category of database systems out there and how they are used.

Eric Dodds 1:57
Let’s dig in and hear from Robert.

Robert, welcome to The Data Stack Show. Thanks for giving us some of your time.

Robert Hodges 2:04
Eric, it’s great to be here with you and Kostas.

Eric Dodds 2:07
Awesome. Okay. Give us your background. So you’ve been working in data and databases specifically for quite a while. So give us just a quick rundown and then what led you to Altinity today.

Robert Hodges 2:20
Sure. I started with databases in 1983, I was actually serving in the US military and as a programmer in Germany. And it just happened one day, the unit I worked for, they bought this database, it was called M 204. It’s a pre relational database. And they said, Okay, learn about this thing and write some apps for it. And I started doing that. And it was the most interesting software I’ve ever worked with it, because it had a language it had APIs, that dealt with data. There are all these interesting ways you could organize things to make things run faster or slower. And then inside it, as I came to learn more about how it worked inside, it was just a fascinating piece of software. So by the time I got out of the military, there was nothing I wanted to do more than go work for this company that I design this database. As it happened, I ended up going back to school at the University of Washington, so I didn’t go there. But I then ended up down in the Bay Area, and worked for Sybase for seven years. And that got me completely hooked. It was I’m not a CS person, I actually grew up studying things like economics, and Latin and Japanese. But working at Sybase was kind of doing a master’s in CS. There’s just so much technology, so many great people to learn from. And so from that, then pretty much continuously since then I’ve worked on databases, and the thing that has kept me hooked on them is just everything interesting. And CS shows up in databases sooner or later. So I worked at Sybase. I then worked on a couple other startups or building apps on top of databases. I ran a company that did database clustering for my sequel, we sold that to VMware, and went and lived there for a while. And then what drew me to this current job was my best friend is one of my best friends is a guy called Alexander Zeitschrift, who’s CTO of this company called Altinity. And they were doing enterprise support for ClickHouse, this new data warehouse, and he kept telling me while I was working at VMware, Hey, Robert, I know you like databases, you should come check this out. And so eventually I did. And it was so interesting. I thought, okay, I’ve done interesting things at VMware, but I’m going to get back into the startup world and try this out. And that’s how I landed at Altinity.

Eric Dodds 4:43
Very cool. We want to hear a lot about ClickHouse and Altinity, but a couple of questions for you. So first of all, you didn’t study computer science in school, but you did study languages, and I’m just interested to know how How did your study of languages influence your sort of understanding or work with databases? Was there a relationship there? Do you feel like it was helpful?

Robert Hodges 5:10
It was pretty tenuous. I’ll tell you, first of all, how languages helped me. They got me my first industry job. And the reason was that I had I have a master’s degree in Japanese Studies. And when I got done with that, I was actually a certified translator. There’s an American Translators Association. I was a certified technical translator of Japanese to English, so I could read things like Nikkei electronics, and Japanese, Sybase was entering the Japanese market, they needed somebody to test their software, you have to be able to read Japanese because you have to be able to if you screw up the data, what will happen is it will cause your kanji to become corrupted. And it doesn’t mean you see it, obviously, it just morphs ordinary kanji into characters that have been used since the Middle Ages. So it was so they needed somebody who could read Japanese, that’s how I got that job and, and, and got in at Sybase. So I think in the, in my academic background, I think the most useful things were mathematics because I even though I didn’t major in it, I have kind of sick interest in, in discrete mathematics, things like sets, sort of formal logic, things like that. The other thing is, I ended up doing a lot of in Latin, you spend a lot of time reading people like Cicero. And here, it turns out that the rhetoric that the principles of rhetoric that you learn from somebody like Cicero, are incredibly applicable in technical companies because you’re always explaining stuff, you’re always trying to figure out, hey, what’s the who’s the audience? What do they want to know? What do I want to tell them? How am I going to do it? This is Cicero is all about this, it’s he spent his life doing this. He’s one of the best people that ever lived at this, at doing this. This goes back to reading Latin, but it was not so much the language as the content of what I was reading, that then could be directly applied to technical jobs.

Eric Dodds 7:13
Fascinating. Oh, that is so fun. Okay. Next question. And we will get to the ClickHouse stuff, but Kostas knows that I love to entertain myself by learning about people. You’ve done databases for a long time. A lot of times you hear about people like, “I started in databases, and then I kind of went and did this other thing.” But you’ve stuck with it. And you mentioned just a minute ago that it sort of combines all the things in CS that you like, but you’re so excited about databases, and you’re still doing it. I love that, but why are you still doing it?

Robert Hodges 7:50
Well, there are two things. One, as I said, there’s a huge amount of computer science that just comes together in databases. It’s kind of like operating systems, but I think even more diverse in some ways. So for example, I really like distributed systems. It’s just the idea of being able to visualize things working concurrently on a network. What does it mean for things to fail? How can we develop algorithms so that we can get work done? This is fundamental to modern databases because most interesting data problems require more than one note, they just they’re big. So you get this constantly, you end up dealing with very fundamental results, like distributed consensus, CAP theorem, things like that. These are these come up in real life in databases. I think the other reason that the databases are fascinating is they have evolved enormously over the over the time that I’ve been working with them, there’s so there are always new things coming up. Example, when I worked at Sybase, that was doing what I call the, the relational database, cathedral building era, you think of people who think of, if you look at cathedrals in Europe, they went through these phases of building and they were running as cathedrals, they were Gothic cathedrals. And, and people had different components of different components. And they were built out over time in the 90s, or late 80s 90s. It was relational databases, and they were driving things like commerce and, and transaction processing. So that was really interesting. After a while, yeah, hey, we know everything about that we understand acid transaction models, but then new things came along very large volumes of data. So what’s next? Well, Hadoop. There are different ways of thinking about processing data. Now we’re in this completely different area where era where we have huge amounts of data. We want to answer questions about them really, really fast. So now we have all over the place. Data warehouses are popping up that solve this problem. So you just see this constant, refreshing of the problems that you’re dealing with and sort of new things to attack new things to learn that that’s why I’ve stuck with it. I don’t feel any need to work on any other computing problem.

Eric Dodds 10:07
I love it. It’s so fun. Okay, last question for me. We’ve talked about a lot of databases on the show. I don’t know if ClickHouse come up, but I know some of our listeners have used it. Some of them have probably heard of it. And then I know there’s probably a subset who it’s a new term for. What is ClickHouse? And what makes it unique or different as a database? And you’ve been doing this for a long time, and it really attracted you, so tell us what drew you to that.

Robert Hodges 10:36
Yeah, so first of all, let me just tell you what it is, how it’s the same as things that have come before, and how it’s different. So ClickHouse is a SQL Data Warehouse. That means it’s a database. It’s designed to talk SQL, which is kind of the winning language for manage managing data. And as a data warehouse, it is designed to scan very large amounts of data and give you answers very quickly. So that this is a class of databases, it started to develop in the early 80s. And with things like Teradata, and what became Sybase IQ. And it has continued over time, through the most modern incarnations, things like Snowflake things like redshift, BigQuery, which have evolved the technology but are still solving the same fundamental problem. ClickHouse differs in a really fundamental way. One is that it is open source. So unlike most data warehouse technology is proprietary Snowflake is a great example. What being open source means is that it’s accessible to anyone, any developer can grab it, to stick it on their laptop, develop an application. Moreover, you can use those applications in any way you want. So that gives you this freedom to use it. The other thing that ClickHouse does, it’s really interesting is it specializes in low latency response. Moreover, not just low latency, like fast now and then, but guaranteed low latency. So for example, you can build applications where at p 99 you’re going to get sub one-second response, and that these two properties have the flexibility, and then the fact that it can give fixed response time on very large data is sort of a real game-changer and explains why the database is becoming so popular.

Kostas Pardalis 12:27
Quick question here from my side, Robert, because you mentioned data warehouses, like Snowflake, for example. And usually when we think about data warehouses, we think about queries that might take hours writes, to answer an equation, and it’s common, I mean, people especially for boats that you don’t have any kind of latency requirements that are on the second or middle, whatever, like, it’s fine to do that. So from weather, there’s not that ClickHouse does not do that. Actually, it’s focuses on how we can have the lowest possible latency but queries that we are asking.

First question, it’s a little bit more technical. How do we do that? What trade-offs we have to make there in order to focus more on the latency, right? Because in engineering there are always trade-offs that we have to make at some point. So what’s the difference technically between ClickHouse and something like Snowflake of the end?

Robert Hodges 13:33
It’s a great question and I think you can answer it in two ways. I think they’re both relevant. One is the architecture. And the other one is the features. So let’s look at features. So ClickHouse was developed, originally at Yandex, to solve a specific problem. Yandex has some piece of software called metrica. It’s a lot like Google Analytics. And basically, people can come in, and they can run queries on it through a nice interface, and see the traffic on their websites. And just like Google Analytics, they can choose different combinations of things where are they coming from how long do they stay on the site, so on and so forth. So, what you need to do when you’re solving that problem, this needs to come back very quickly. Moreover, you cannot pre-compute the data that so pre aggregate the data, as we would say, in all the possible different ways that people can ask for it. What you need to have is just a piece of software that can take the raw data, just the messages that are arriving from logs, and answer these questions extremely fast. So from the very beginning ClickHouse was developed to focus on this problem have a very, very large table, which has potentially many, many columns of source data, and to be able to answer numerical or aggregation questions straight off that raw data set. In a very short period of time, as a result, the energy in ClickHouse, many databases have come out, they say, Hey, we’re SQL compliant, every single SQL feature you can imagine, is we’ve got it. That’s the postgres story and ClickHouse, the story is no, we don’t have every single SQL feature, we don’t have ACID transactions. So we don’t have a delete command. What we do have is 40 different kinds of hash table implementations inside, each of which is tuned to a specific use case where that hash table organization is going to give us the edge in providing rapid response. So that’s a really fundamental difference from other data that from the traditional enterprise data warehouse. It’s a product that it’s really product-led development, that’s coming up, starting with this very, very basic use case, and then extending out to other ones, and making feature trade-offs along the way. If that makes sense. The other part that I mentioned is the architectural differences. So ClickHouse has, unlike Snowflake, ClickHouse has still has a traditional, what we call “shared nothing” architecture, where it’s basically a set of nodes, with Attached Storage connected by a network. Now we’re working our way toward the model that that that is used by Snowflake. But in the meantime, this is just extremely fast. When we run benchmarks against something like Snowflake, yes, it’s a great architecture. Yes, you can store data really cheaply, because it’s backed by object storage, but ClickHouse answers these questions in a fraction of a second. And in many cases, Snowflake takes 10s of seconds or even minutes to answer the same question. So there’s a real architectural difference. And again, it’s sort of focused on delivering the speed and solving this specific set of problems around low latency access to large quantities of data.

Kostas Pardalis 17:05
There’s some great information, whether you shared with us right now, you mentioned the features, and you said, for example, but not release, right, or there are no acid guarantees there, as we see in databases like Postgres. What else do we have to trade-off there? Things like Joins, for example?

Robert Hodges 17:22
Yeah, joins are a great example. And don’t get me wrong, there’s a transaction model, but it’s not acid. So the transaction model is if I write a block of data, it always shows up. We never get torn blocks. And by block, I mean, it could be a chunk of 100 million rows in a table. But Yan joins, absolutely. There have been some real trade-offs. So ClickHouse, by default, uses what’s called a hash join. So and that’s a join that works very well, where you have one table that’s very large. And so you’re going to scan that table, and then you’re going to be with the data that you join, you will preload into memory. And then you’ll just look it up in memory and see if you got a match. And if you do have a match, you’ll pull the extra, the join columns over. So that is different from a database like Snowflake, which can do merge joins, for example, where you can have instead of just a very, very big table, and maybe smaller tables joining with, you can have very large tables. But the trade-off there is merge joins are great, but they’re not fast. Because in order to process these joints, and to process complex queries, with that just arbitrarily joined data for many locations, you will have a long process where you do a join, you then shuffle data around between machines, you do another join, and so on and so forth. So there’s a very, there are real trade-offs here that we have to deal with. And that were part of the design choices for ClickHouse.

Kostas Pardalis 18:53
That’s super interesting. So okay, we talked a little bit about the more technical side of things, and the trade-offs there. But usually, we make trade-offs because we try to focus on different problems, right. So right, what are the use cases behind ClickHouse as a technology? Why we made these and how these trade-offs that we talk about like address it.

Robert Hodges 19:18
Yeah, so I think that there are an increasingly large number of use cases. The first one was web analytics. And we still see that we still see many of the users of ClickHouse pursuing that use case, I’ll give you an example. Cloudflare. So Cloudflare is a super successful company. They provide DDoS protection. They shield websites, they provide networking, DNS, lookups, things like that. It turns out that if you’re a tenant of CloudFlare, and you go to the dashboard, the chances are that the data that you’re serving that you’re seeing is actually served up from a huge ClickHouse cluster that they maintain. I’ve been talking to them recently, but last I looked it was on the order of 100 20 nodes. And so everything that’s popping up on those dashboards is coming out of ClickHouse. And it’s rapidly assembling this data from the sources like DNS lookups. But there are some other use cases that I think are more interesting because they’re completely new. I give you a simple example. There’s a company called MCSE, which is a video content delivery network. They are the folks among other things, that deliver video streaming video for things like the American Superbowl. Now that’s part of their business. But an interesting aspect of what they do is they collect analytics on the video, the streaming video in real-time. So they have the content delivery network, which can provide telemetry, they have applications running in the user browsers, which can also send up information about what they’re seeing, they combine that in a ClickHouse database so that the people who are operating the Super Bowl live stream can actually look at the metrics in real-time and see like, Hey, are we having rebuffering problems? Are we seeing content bottlenecks? Are we seeing problems with specific browser types, they can recognize those problems, they can diagnose the root cause they can fix them. And then they can go back to the metrics and confirm that, that, that they’ve got a fix. And they can basically do this in the time that the NFL takes a timeout. This is a, this is a completely new business. This business was not possible say 20 years ago with the technology that was available.

Kostas Pardalis 21:40
How is ClickHouse or this class of database different than what we have called so far “time series” databases, that they focus mainly in terms of use case and stuff like clubs in one building, for example, because the use case you described with the video players, it is close to, let’s say, the problem of observable. So how are they different? Or is there an overlap there?

Robert Hodges 22:09
There actually is overlap, and it’s a great question because ClickHouse really processes a superset of the use cases of databases like timescale and influx. And so the way that timescale or time or time-series databases work, is they just assume that you have a series of measurements that are characterized by time, and then have an arbitrary set of attributes attached to them. So it’s not the same model as we see in a traditional SQL database, which has a table with columns. So rectangular format, what ClickHouse does is it solves that same problem, but it just does it in a different way. So first of all, ClickHouse has very efficient, very efficient support for time. It has multiple datatype types for it, it also has a wealth of functions that can do transformations like when you’re doing time series, a very common thing that you want to do is you want to say, Hey, what happened each hour each day, each week each month. So time clocks, has functions to just normalize dates to do that kind of bucketing, and you do it straight off the raw data. So you’re doing a scan, you can just bucket things, and then and then group a group by them. The other thing that ClickHouse does is it is a column store. And so even though you don’t have quite the same flexibility of just randomly adding parameters to it, ClickHouse allows you to add as many columns as you want. And it provides stunningly efficient compression on it, the compression is really outstanding. And it’s not just they don’t just we use LC four and zstd. But on top of it, we have what are called codecs, which are ways of transforming the data before it even gets to compression, to reduce the size and get it into something that is more that’s going to compress even more efficiently. So as a result, you can solve the same problems that time series databases do. But but you have a database, which can also handle much more diverse use cases. And isn’t doesn’t force you to think of everything in terms of that time series mod.

Kostas Pardalis 24:17
Well, it’s very interesting. Do you see these database systems, like ClickHouse, to…

Robert Hodges 24:25
Yes. Yes, I do. I think they are this is just me, I believe that the low latency data warehouses products like ClickHouse, Pino and Druid, which are we’re all kind of in the same that the use cases I mentioned you can do in druid and whether Druid is better for it, then ClickHouse. Well, try your application at it, you’ll probably be able to tell. But yeah, so I believe that they’re going to take over this model. There’s some historical strength behind sort of have a basis for that assumption. And that is the fact that over time, we have seen the SQL SQL on top of a relational model has pretty much subsumed most of the use cases that people have for data management. And so I think this is another case where the time series databases are interesting. But over time, I think databases like ClickHouse, that have very good vectorization use all the data warehouse technologies that have developed over the past four decades, take advantage of those that they will, they will be able to solve ultimately be able to solve this problem far better than the specific then the narrowly focused database as well.

Kostas Pardalis 25:43
That’s super interesting. You mentioned another two technologies that are similar to ClickHouse. You talked about Pino and Druid. Right. So, what are like the what’s similar between the three? And what are like the main differences? What’s between like, all three of them?

Robert Hodges 26:04
Sure, and I’m gonna sort of excuse myself on Pino because I haven’t used it and haven’t really looked into it too deeply. The one that I think is a good comparison, and I think I can do it justice is Druid versus ClickHouse. And I think that they are the same in one fundamental way, which is that both druid and ClickHouse are designed to solve this problem of providing low latency response, no matter how big the dataset gets. That’s a, that’s a very important similarity. So they’re framed around that they frame the problem the same way. Moreover, they also have the idea that a lot of the data is going to be coming off event streams, which are arriving very, very quickly, often millions of events per second. And so they both support, Columnar storage, they have efficient scanning techniques, they’re able to pair it, paralyze not just across CPUs, but also across many nodes on a network to deliver these responses. Where they differ is that druid when it was originally developed, didn’t even support SQL, it did not support joints for a long time, although they’ve been since added. The other thing is that druid has a more complex, it has a more complex operational model. So in ClickHouse, you really just have one process, that’s the ClickHouse database engine, that’s a single process looks kind of like a it’s almost like my sequel, you just pop it up and it runs. We also use zookeeper to keep the cluster coordinated. But that’s it. With druid, you have something like six different process types that serve different purposes, so it’s operationally more complex. In that sense, I think Zouk ClickHouse is a better architecture. On the other hand, one of the things that is good about ClickHouse is it was built from the start to use, or from very early on to use object storage as a backing store. And so that is a very good feature of Druid. But so there’s differences and how they’ve gone after these problems. But I think what’s interesting and where I give the druid folks huge credit, is they frame the problem the same way. In the United States, at least, they were on that first systems that really recognized this problem recognize that new technology was necessary.

Kostas Pardalis 28:26
Yeah, that’s what I wanted to ask you next, because I remember I think, if I’m not wrong, Druid is probably the first technology that’s trying to address these problems and these cases. It’s been around for quite a while.

Do you feel that the fact that we hear so much and we see more products around these problems today, instead of when Druid started? It’s just like, market conditions and timings kind of reason behind. Also, because of the operational choices I’ve made, and how it was of the end to operate.

Robert Hodges 29:04
I think what’s happening is that people are recognizing the business opportunities around this. There’s web analytics, there’s content delivery, network management, there is observability. There’s which was it’s an old problem, but this is a new way to solve it. log analysis. And by that I mean service logs, real-time marketing. Let me just give you an In fact, let me give you one more example of a use case that comes from our customer base. You’re going to a website, you have an ad blocker, which or you have an ad blocker and you’re going through the website after a few pages, something pops up and said hey, wouldn’t you like to take that off and sign up with us, have a subscription we can see visited the website X number of times or in the last hour? Well, that’s actually backed by data warehouse. In fact, in that particular case, it was backed by ClickHouse And the idea is that this information is being fed in real-time. Moreover, the data warehouse is able to give an answer about how many times you’ve been on the page within about 10 milliseconds. So sufficiently fast that you can apply that knowledge in the time that you render a page. This creates a whole new industry, this is a whole new extension of what people can do with real-time marketing. So and interacting with customers. This is why there are products that are attacking this problem, because people are starting to understand, hey, one, there’s a business opportunity to there are technologies like Druid like ClickHouse, like Firebolt, that begin to solve this. And so you’re starting to see people coming into the market, and offering solutions for users.

Kostas Pardalis 30:47
It’s nice that you mentioned Firebolt. I think Firebolt is also based on ClickHouse, right?

Robert Hodges 30:52
It is. I wrote a blog article about it. In our business, where our focus is just to be clear. So our focus is on the real-time applications, ClickHouse as the linchpin, but we’re looking around, because we’re helping people make decisions in early on about which technology is right, and then how to build the application. So we’re super interested in technologies like Firebolt that are emerging, and yes, so Firebolt, announced at a webinar that they gave on the Carnegie Mellon database series out on YouTube, around December 15. They said, hey, we got a new query engine, it’s ClickHouse. Yeah. And, and I started knew this, because I know one of the guys who’s been Wagner, who’s one of their query engineers, and I met him at reinvent, he said, Hey, Robert, there’s so many we interested in hearing, but I can’t tell you what it is. So but when it comes to this webinar, And so sure enough, he taught he did this great talk, and I ended up writing a blog article, just analyzing what they did, and like, what it means for ClickHouse, how we can respond, and also what it means for analytics in general.

Kostas Pardalis 31:59
That’s super cool. One more question from my side and then I want to give some time to Eric.

Eric Dodds 32:05
There’s never one more. There’s never one more.

Kostas Pardalis 32:11
I’ll keep it to one more because you said the magic word, which is marketing analytics in real-time. So I’m pretty sure Eric—

Eric Dodds 32:19
Oh, man, that wasn’t even what my question was about.

Kostas Pardalis 32:22
Oh, okay. But I have one more question that I want to ask you, before I give the microphone to Eric, there is another set of real-time technologies, which is this distributing brokers like Kafka, right? That they build a whole business around real-time data. And outside of the broker itself, which it has like a very specific use of lightning in the stock of the company. They have also built ways to query the data there. They also like how to create some kind of database on web and all that stuff. How are these technologies like ClickHouse or Pino or Druid compare, or work together or compete with something like Kafka? How do they work together?

Robert Hodges 33:17
These are completely complementary technologies. I know that Kafka talks about they talk about case SQL and the ability to do queries on event streams, this is significant in some use cases. But I think what we see much more, so we have getting to our getting up toward a couple 100 customers, I would say half of them are using either Kafka or newer versions. So for example, it’s very interesting that a technology called red panda, which is a rewrite so Kafka compatible, uses vectorized vector processing to make it extremely fast. So in these applications that we’re discussing real-time marketing, you’ve got data coming from multiple upstream services, and what you need is a big pipe that you can just toss it into, without worrying about what’s going to happen to it at the other end, that is Kafka. Kafka solve this problem brilliantly by creating this distributed log. And then we’re at the other end because in order to receive that to actually run meaningful queries on this, you’re going to have to have all the data in one place. Let me give you a concrete example of why you want the data warehouse at the other end. Let’s look at security. So we have a bunch of customers that one way or another, are dealing with threat analysis and notification about security problems. What happens in a what happens in a typical problem with security is you notice that some machine is beginning to make DNS requests for a server that is infected or a source of malware. So you get an alert that pops up that comes in and gets stuck into Kafka stuck into red panda shows up in the data warehouse, somebody alerts on it, and says up, there’s something you got to look at. Well, the next thing you’re going to do is the fact that one server is making this call isn’t just, that’s an indication that you got something to look at, in order to figure out what’s really going on, you need to have not just that information, but you now need to be able to look at the history of DNS calls for that particular machine for that particular data center for that particular type of application, however you choose to, to divide it up. Moreover, you want to see the history going back, often days, weeks, or months. So the data warehouse, by having the data in there, allows you to get that initial notification, and then do the analysis that’s necessary to figure out what’s really happening. And, and then do something constructive about it. So these two things are, they work perfectly. And in fact, a lot of our work we’re focused on is getting these ClickHouse, the linchpin to get this to work, but the event stream, and then the platform that they that you run on. And for us, we do an enormous amount of work on Kubernetes. That’s one of the places where this, our company has done a lot of innovation. For example, we’re the ones that made ClickHouse run on basically built an operator for ClickHouse, but we’re focused on helping people tie these two things together, and then make them work efficiently to build these applications.

Kostas Pardalis 36:32
That’s great. I could keep asking questions, but I—

Eric Dodds 36:35
Go for it.

Robert Hodges 36:37
Go for it, yeah.

Eric Dodds 36:38
Take another one, Kostas.

Kostas Pardalis 36:39
No, no, no, no, no, no.

Robert Hodges 36:40
I can tell you, I don’t want to bore our listeners, but look, I never get bored with this stuff. What’s so cool about this is that, and I think the best part of working in this job is that the these are I talked about cathedral building, these are the new cathedrals. So this generation of cathedrals is solving problems, like dealing with threats coming in from malware and things like that. It’s just seeing the creativity that people exhibit in building these applications. Some of them like the real-time marketing, one where they’re popping up the thing on the webpage, it’s like, some of these applications, you have to see them to believe that they’re possible. I mean, it’s almost like it’s sort of mind-expanding to see the way that people use this technology.

Eric Dodds 37:28
Okay, so I’m going to jump in, Kostas. So let’s step back. And the question I want to ask them, I’ll give a little context here. I’ll ask the question first, and then give a little bit of context. When do you implement ClickHouse? And so and I’ll give a little bit of context, right? So let’s go through the lifecycle of a company who’s building an app, whether that’s consumer b2b, or whatever, they’re scaffolding out the software, maybe they have a Postgres database, it’s sort of near the back end of the app, they grow, they have a data warehouse that’s doing analytics, etc. Maybe they have a data lake, like, in that lifecycle, which is sort of established to some extent, at this point, like you build onto your sack, when do you adopt ClickHouse, and maybe that’s, like, immediately from your perspective, or maybe their use cases where ClickHouse is better suited for, but I just love for you to give our audience a sense of like, in the arc of building a stack from like two guys in a garage, who are just querying the production databases to see what’s going on to on an enterprise. And I need, like, real-time data happening now. The real-time marketing use cases, like when should ClickHouse, enter the picture in sort of the arc of maturity?

Robert Hodges 38:51
Sure, I think that’s a great question, and there are really a couple of answers. But what we see is, is two patterns. One is, and this, this is a pattern that is, was previously very common. People would outgrow their existing databases. So Postgres and MySQL are super popular, a lot of people right, right now, that if you’re building an application, you need a database for many developers to go to databases is becoming Postgres. And so you’re collecting, you’re processing transactions, maybe you’re recording DNS requests, you stick them on Postgres. After a while you’re successful. So you have you start to shard it across more and more Postgres instances using something like situs, you realize, hey, this is kind of need a little bit more horsepower. So we’re going to add aggregations in there again, to an end Postgres. After a while, you figured, hey, this is just we’re outgrowing this technology. And that’s when you roll in ClickHouse because ClickHouse can literally because of the trade-offs that we talked about. There are cases where ClickHouse can solve it, solve a question or answer a question in a second. That would Take Postgres literally hours to answer. So. So that’s, so that pattern is very common. And in fact, one of the early US users with Cloudflare. That’s exactly how they grew into ClickHouse. And they started with Postgres, many others have done this, and, and sort of grew into ClickHouse over time. Now that people can see these patterns and see and recognize him going into a domain, like I’m doing real-time marketing, I know that the datasets are going to be very large, they can do ClickHouse just right from the start. And the cool thing about ClickHouse and the other open-source databases is because they’re open-sourced, you can just go get a community build, you can just pop this thing up on your, on your laptop, develop your app. And then the problem that you have is, okay, I’ve developed the app. Now I just need to get it to run in a production environment. And so that’s what we get customers and see users entering coming to us from both of those paths, replacing an existing database that they’ve grown out of, or they built the app, they know, they know, ClickHouse is what they need. And the problem there is more, okay, how do I get this thing to run an environment and not have to worry about scaling it, for example, because, hey, they run on our cloud. And we do that for that.

Eric Dodds 41:17
Okay, next question on that, okay, real-time marketing, what is the distance between the data that is needed and the capability to deliver in real-time? So the distance, let’s say, between ClickHouse, and like what the end-user is experiencing? Right? Because, and I’m thinking from the data engineers standpoint here, right? Like, if I have the data and ClickHouse, and I can deliver, like, I have the horsepower to do this stuff in real-time. Like, I still have to actually get the data into a user experience into the front end of some sort of application in real-time. And so I’d love to know, what do you see from ClickHouse users? Like what’s the distance between ClickHouse and sort of that actual experience? Because there’s, that’s sort of the other piece of the puzzle, right, in terms of the data engineering, or really even software development piece, right? Like, we have pipelines, we can do this in real-time. But we still have to, like, whatever it is, there’s an app experience, a browser experience, all that sort of stuff?

Robert Hodges 42:24
The pipelines that we’ve been talking about that the event streams, that technology is now really robust, and I think very well understood it, the apps are, are interesting. So there are a couple of traditional ways that people are bringing that data out and making it available. One is to do is to use dashboards, use a tool like superset, for example, which is a suit, which is open source, or Grafana is another one. These allow you to build dashboards where people can quickly see what’s going on in and basically react to things very, very quickly. The problem with them is, is in both cases, they’re kind of fixed, right? If a user wants to go in and tweak graphs and quickly change things about the data, that’s harder to do, because those products don’t support it out of the box. So the other thing that people do is they just build these, they build these front ends, they’ll use JavaScript or TypeScript, react applications, and then they go straight back to ClickHouse to get questions answered. What’s now happening, and I think the really interesting development that we’re kind of tracking is people are building, that you have to build the middle tier. And people were just doing this by hand like, hey, they’d have react, and then they’d have node in the middle there, and they just hand-coded. But what’s happening is we’re starting to see products emerge in that space, like his Sora. That’s a product that does graph kit that will stick a Graph QL API on top of an existing database. There, they’re adapting it for ClickHouse. We also see things like cube.js, that sticks what looks like a sequel API and builds what we used to call cubes back in the back in the old data warehouse days, that’s another middleware tier. And then applications just connect to that. And so this middle tier that serves up the data and gives you sort of an indirection layer between the database that’s an emerging topic. And then you have the dashboards and your typical front-end stacks that are talking to the data.

Eric Dodds 44:32
Totally super interesting. We’ve seen some really interesting architectures around something like Kasura who basically exposes that API to enable some really interesting things. That’s, this is super fun. Those architectures are really fascinating. And I think you’re right, like, circling back to what you said earlier, it’s really fun when all this stuff comes together. And a lot of it really is databases and sort of variations, right.

Robert Hodges 44:59
Yeah. And we’ve got Sora doesn’t have the we haven’t seen her serve, but we see other people using Graph QL that they’ve. So we know this is and we get a lot of questions like, Hey, how can I, how can I just pop a Graph QL interface down because I think this is the other thing to understand is that the challenge that it’s not just a technical challenge, but you have dev teams, and not everybody can do everything at once. So for example, you have teams that understand the business problem that they’re solving. But there’s this relatively complex infrastructure that they have to operate to design on and then operate, that make these applications work. So I think one of the challenges that dev teams are having is not so much just figuring out the design, but how to get the business part done, that user that end-user experience was going to make the product successful, how to do that without becoming an expert on say, I don’t know, compression and ClickHouse, and how to decide where hey, if I tweak this compression by 5%? Or how do I fix a bug and ClickHouse that that’s something that and so I think, in fact, what a lot of what our work is is is around is we provide that expertise, so that the developers can just be developers, they can figure out how to make it work, but they don’t have to worry about necessarily about how to operate it or become immediately become deep experts in ClickHouse and some of the technologies it connects to.

Eric Dodds 46:25
Okay, a non ClickHouse question for you here, and Kostas, I would love for you to jump in. This is super fascinating is very helpful for our listeners, but you are an expert in all things database. Outside a ClickHouse, what are some of the other interesting database technologies that you’ve seen that have come out the last couple years? Is there anything— Of course, ClickHouse is exciting, but anything you’ve seen that sort of piques your interest in terms of breaking new ground or open-source projects?

Robert Hodges 47:00
That’s a really great question. I have to say the analytic databases right now are the main thing I’m focused on. I will say that I think that what, what I think is, I think that Firebolt is doing something pretty interesting, they were very quiet about what they were doing for a long time. But what they are doing is kind of throwing down the gauntlet, to say, hey, you can have the capabilities of Snowflake, which include complete a broad SQL implementation, the ability to store data in object storage. But now, you’re also going to get this real-time behavior and, and do it in a cloud database. I think that’s the next step in Cloud database. Because if you look at, if you look at what happened over the last decade, two really significant databases came up redshift, that was the first one to bring databases to the to bring data warehouses to the cloud, the brilliant idea, and, and just, you went from potentially having to spend two months to get something installed to about 20 minutes with a credit card, Snowflake, separation of compute, and Storage, BigQuery has done the same thing. These are really significant. But again, that came at a cost because object storage is really slow. And in fact, latency, low latency is not a focus of either of those products. So So Firebolt is doing something really interesting there. And I think kind of, as I say, laying down the gauntlet for other people to try and match their capabilities.

Kostas Pardalis 48:34
Eric, do you expect me to serve my opinion on that?

Eric Dodds 48:38
Absolutely. Actually, actually, I am interested in this because, of course, you moved to Starburst, which was both upsetting and exciting for me. But what are you seeing? Because you sort of federated querying and you’re seeing people query all sorts of stuff. What are you seeing out there, Kostas?

Kostas Pardalis 48:59
I don’t think that my answer is will come from my experience at Starburst because we’ll get three no and presto and like all these technologies, like big have been around for a while. It’s not like something new, different stuff like they were mainly used by Mark like white clouds, enterprise companies, I think what we will see there is a special now with the data lakes and the ClickHouse, part of the that they will still be you, they will be used also from, let’s say, smaller companies. So they will become much more approachable, let’s say to the people out there now, exactly what are the use cases and how they can beat or not compete with solutions like BigQuery or Snowflake? That’s something that I think it’s a big, big conversation for another episode. But there are some stuff that I see happening like in the industry and like in the technology around databases that I find very interesting and not necessarily like product-level kind of technologies right now. But one of the things that I really enjoy is playing around with something with a project for duck dB, which is a very interesting approach of building like a columnar database. Use cases have been Saturday usually would need something like a Snowflake to do, it’s very distinct the user experience is brilliant, actually, like it’s a very, very interesting and very refreshing way of doing things. And another thing that they’ve done, and that’s what I love about the team behind duck B is that they really, I haven’t been able to really enjoy what they’re doing, they’re experimenting a lot, they compiles back to be the worst, which means that the whole thing can run inside your browser, which is also very interesting experiments. So that’s one path. And the other thing is anything to do with arrow. I love the arrow projects, I think the people that are doing like some amazing job in trying to build, let’s say, a framework of how to represent nerves in memory in a vendor-agnostic way, pulling out data, and slapping a lot of tension on this project and see exactly like how is going to impact the database industry?

Robert Hodges 51:25
And, Eric, if I could also add to this, I think that one of the things that I’m really looking at is actually Kubernetes. And Kubernetes has been around for a long time. And but it’s only in the last two to three years that it has really come into its own as a platform for data. One of the keys, there’s been a couple of key technology developments, but I’ll just want is the notion of an operator, which gives you the ability to define resources of your own that are things like distributed databases, like ClickHouse, like whatever database you want to support Elastic Search. One of the things right now is everybody is very, very focused on operating in the cloud. But as Kostas said, There are technologies out there like duck dB, that run just great anywhere. And I think that as people begin to have more and more of these large workloads, it’s going to be, it’s, it’s going to be obvious that hey, there are times when Yes, it’s great to run in the cloud. But it’s, for variety of reasons, you may want to run in your own data centers, or in clouds that you control directly, where you do not have an outside vendor, controlling things. Kubernetes is the answer to that it is the portability layer that lets you run not just in the Amazon cloud, so using eks. But you can now, hey, if you decide we’d like to run this on Hetzner. Or we’d like to run this on Equinix on Equinix. Metal Kubernetes is a tool that allows you to do that I am in the right now we have a very centralized computing model in the cloud, but it’s not going to everybody’s very focused on that. That is these pendulums always swing back. And I see this really interesting development. And this is, in fact, something that we’re very focused on in our company, is using Kubernetes. And enabling people to run these data warehouses can build these applications, wherever they want. And we already see our customers doing this using Kubernetes, in combination with managed services, for example, to build these very flexible, very portable application. So I think there’s a, there’s gonna be some really interesting things in that area as well, along with this new technology, a new specific databases we’ve talked about.

Kostas Pardalis 53:42
That’s a very interesting point you brought there, Robert, and I think, Google, we’re trying to address this with amples. But I don’t think of this project went very well, I don’t think that they succeeded. Having like, let’s say some kind of technology that can run like this hybrid model of like, yeah, you can deploy this on a cloud array, like on your bare metal or working on your writing. Right. But maybe that’s also an execution thing, right, and not like something that has to do with the market itself. But that’s going to be like things like very interesting to see how it’s going to progress.

Robert Hodges 54:21
I think one of the most interesting developments in Kubernetes is not a technological one so much is the fact that there’s now managed Kubernetes in so many different environments. And so all of a sudden, because the big challenge with Kubernetes and I think what’s what held back adoption, is it’s not that hard to run applications on Kubernetes I think any competent developer can do that. It’s running Kubernetes itself, but as soon as you have distributions that sort of every cloud provider, large or small has this so, take that away, which then makes Kubernetes a practical answer. Plus there, there’s just better technology for running, running it even on-prem At this point, it’s more people know how to do it there. Companies like Sousa that are and Red Hat OpenShift that are supplying technology for this. I think we’re going to see some really interesting stuff around Kubernetes over the next few years.

Kostas Pardalis 55:14
I totally agree on that.

Eric Dodds 55:16
I would describe that Robert is option value. And I think that’s becoming a really important like, it’s fun for us to talk about this, all three of us are working to build technology in the data space. And that’s great. And we love talking about it, right? The people we serve, are trying to build these datasets. And that’s pretty hard. It’s actually a lot harder than probably we say, in our marketing materials. Because they have this world of choices and like confusing marketing. But one thing that I do see, and I think you make a great point is like the option value, and building on something like Kubernetes, where the appetite for getting locked in is decreasing at an unbelievable rate. Right. And people want tools that give them option value, right? Open source is a component from that multi-cloud is a component of that. And it’s a big deal. And I think it’s going to be the companies that don’t sort of adopt an architecture that allows for that option, value and flexibility are going to struggle now. It’s a big market, it’s changing rapidly, but in some ways, it’s a decade-long change that we’re at the beginning of that. But I think that’s a great point.

Robert Hodges 56:28
Yeah, and I think that it comes back to— What we see is there are just all these different choices that come into where you want to run a particular application, hey, where are your data sources? Where’s the data coming from? Where, what compliance Do you have? What’s your cost model, the economic model behind the application. And I think that what I’m always amazed at is, is just how varied the information is that goes into the choices and how many there and what different choices, companies both large and small, tend to make. The VCs like to call it optionality. I think we’re definitely going to see this as a big issue. People will get the systems deployed, I in fact, just the obvious case, people get systems deployed. And they realize, Wow, this is actually kind of pricey up there in snowflake, and I’m locked in. I’m not going to be coming back. That’s the roach motel. I’m not coming back out.

Eric Dodds 57:29
At the end of the day, it is a really fun space to be in, holding some cool stuff. And I have had the privilege of letting the show run long because Brooks isn’t here, so we’ve gotten to go long, Robert, which is great because Brooks isn’t waving the flag saying, “Land the plane! Land the plane!” I love when I get to do that because these conversations are so wonderful. Thank you again, for giving us some of your time. Thanks for telling us about ClickHouse and talking about real-time. This has been really great.

Robert Hodges 57:59
Yeah, and Kostas, thanks for the reminder. I’ve absolutely got to go back and check that out. Yeah, this is really fun. Thank you both.

Kostas Pardalis 58:10
They have done some, like, I think this theory of having fun, they are doing like some really interesting experiments. I don’t know if we’d have any plans like productize of like what I wanted to do about it, but you see some updates from them. You’re like, oh, wow. Why did they do that? inscribing this thing and it’s like get some ways. For me at least it has become a very useful tool sometimes especially when I have data… I haven’t been to like pontiff files I just won’t like to create like it’s the easiest way to do it. Like super, super easy.

Robert Hodges 58:46
Yeah, that’s why the fun never stops.

Eric Dodds 58:52
Maybe we’ll put that in the title of the episode.

Robert Hodges 58:54
Yeah, the fun never stops. Yeah, feel free to use it. I’m releasing it into open source. It’s a permissive license.

Eric Dodds 59:05
I have control over the show because Brooks isn’t here, so consider it done. Databases: the fun never stops. Robert, thank you.

Robert Hodges 59:11
Thanks, guys.

Eric Dodds 59:12
Yeah, and we’ll talk again soon.

Robert Hodges 1:00:08
All right, bye. Bye.

Eric Dodds 1:00:08
That’s always a great show two things, which I know isn’t allowed. But I guess I make the rule so I can kind of break them, and Brooks isn’t here so we can do what we want.

I’m fascinated. I want to start a separate podcast, not that I would ever do this, about how people with academic backgrounds have a really different from what they do today, like influence their work today. That’s just so interesting to me. And I thought it was fascinating hearing about how his study of language was very practical, like, Okay, I got a job where I was like translating Japanese, but then also the rhetoric of studying Latin and reading Cicero influencing the way that he thought about some of the sort of theoretical pieces of databases and they’re using IT implementations. I love that I love that stuff. Maybe I’m just a big nerd. And I love learning about people’s backgrounds and that was my big takeaway. Just love hearing about that. How about you?

Robert Hodges 1:00:11
Yeah. Okay, I

Kostas Pardalis 1:00:12
have to say book by Steve. This is also cultural it. He calls it like to do with like the American culture. But it’s a big and very interesting conversation that we can have. And I had to agree with it’s very fascinating. isn’t the first time that we hear from someone to have like such an interesting journey and delay, get into technology, right? Sure. But outside of this, I really enjoyed the conversation today. And I’m not going to say anything about like, like the technologies that we talked about the use case and the markets are volatile. What I really, really enjoyed was spending like 30 minutes or 60 minutes with someone who is so excited about stuff like that.

Eric Dodds 1:00:55
Totally agree.

Kostas Pardalis 1:00:56
And valid is like, probably like, the best thing that can happen to me, as opposed to this show. It’s so refreshing.

Eric Dodds 1:01:04
So fun. All right, well, thank you for joining us. A lot of great shows coming up with guests just like Robert who are very excited about what they’re working on. Eric and Kostas here signing off and we’ll catch you on the next one.

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 82:

Databases: The Fun Never Stops with Robert Hodges of Altinity

April 6, 2022

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter