Data Council Week (Ep 1): Discussing Firebolt’s Engine With Benjamin Hopp

April 25, 2022

Welcome to a special series of The Data Stack Show from Data Council Austin. This episode, Eric and Kostas chat with Benjamin Hopp, Lead Solution Architect at Firebolt. During the episode, Ben discusses all things Firebolt, ClickHouse, and what’s to love and dislike about working in the data space.

Notes:

Highlights from this week’s conversation include:

Ben’s career journey (2:55)
What makes Firebolt different (3:58)
Firebolt’s data product family (7:37)
Table engines and Firebolt (10:57)
Ben’s favorite part of ClickHouse (12:52)
The experience of building an optimizer (15:19)
Where Firebolt fits into architecture (17:27)
Working in the data space: to love and dislike (19:51)
Coming soon in the near future (24:35)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 0:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Welcome to The Data Stack Show. We are recording on-site at Data Council Austin, which is super exciting. Let’s talk about our first guest. We’re in a little conference room here with all the mics set up, which is super fun and we’re going to talk with Ben from Firebolt. Now, Firebolt, Kostas, is— I don’t know if it’s come up a ton on the podcast, but you and I’ve talked about Firebolt and it’s a really interesting product, and all their marketing and even the name of the company is really focused on speed, of course. And so what I’m interested to know is when you talk about blazingly fast analytics, for example, that can mean tons of different things. And specifically, what I’m interested to know is if it’s impossible for a new-ish tool to be the fastest across the entire sort of value chain, right, like as it relates to like data. So I want to know, like, where, what specific thing? Are they like, super-duper faster? Are a couple of things that they sort of staked their claim on. So that’s what I want to learn, and what does that look like under the hood? How about you?

Kostas Pardalis 1:36
Yeah, I mean, for me, it’s always interesting to see companies going to market with data warehouse solutions. Database systems are notoriously difficult to build, meaning, like many teams have tried and failed, they usually take many years to get them to move on the market. So it’s very interesting to learn more about like the whole journey of how they started and how they ended up on the stage. Right now, we’re pretty much competing with other Cloud Data Warehouse solutions out there, like Snowflake and BigQuery. So I’m very curious about this journey and where the product is today, what is missing? And what’s next.

Eric Dodds 2:19
All right. Well, let’s dig in and talk to Ben.

Kostas Pardalis 2:21
Let’s do it.

Eric Dodds 2:23
Then Welcome to The Data Stack Show. We’re so excited to have you. You are a solution architect lead at Firebolt and we’re super excited. We’ve wanted to have you on the show for a while and we caught up with you at Data Council Austin. So we’re in person in Austin and a really fun way to do a show so thanks for giving us some time out of your conference to spend on the show.

Benjamin Hopp 2:43
Thank you for having me. Really excited to be here and talk about data.

Eric Dodds 2:46
Cool! Okay, so give us your background. How did you get into data originally? And then how did you end up at Firebolt?

Benjamin Hopp 2:55
I’ve been in the data space my entire career. I kind of got pulled into it and started out of college, thought I was going to be a Java developer, worked at a company that decided I was better suited as a database administrator. So as a Microsoft SQL Server database administrator for a few years. From there, I went to work at Hortonworks. Back in the days prior to Cloudera. I did consulting for Hortonworks for a number of years, specialized in streaming data with Apache knife. That kind of brought me into the streaming world. And then I went into work for a company called imply with Apache druid, some streaming data, big data projects, worked briefly at a company called up solver doing streaming ETL. And that brings me to Firebolt, where I’ve been a Solution Architect for a little over a year now.

Eric Dodds 3:47
That’s super interesting. Yeah. So you’ve worked at a lot of companies that sort of built on like core technology, a lot of its open-source, which I definitely want to hear about. Tell us about Firebolt, though.

Benjamin Hopp 3:58
Firebolt is a cloud data warehouse in the vein of something like Snowflake, but our claim to fame is really fast analytical queries. So we are targeting use cases that need sub-second performance that are powering dashboards are powering visualizations that really benefit from low latency queries, high concurrency workloads.

Eric Dodds 4:19
Super interesting. Give us a couple examples there because low latency or even real-time is like those terms are like really relative, like some companies are like, “Data every hour is real-time.”

Benjamin Hopp 4:31
When we say low latency, I’m talking specifically on the query latency, not necessarily the data ingest latency. Got so queries that, when you load a page, it may send out 1015 queries. You want all of those queries back sub-seconds. So that’s a query latency. As far as the data load latency, we’re not a real-time data warehouse, got to do batch loads. So five to 15 minutes is usually the highest frequency that you’re going to see Heap. Obviously, we want to move towards real-time we’re building out Kafka integration and things like that. That’s gonna be coming soon. But right now it’s micro-batch ingestion.

Okay, so on the query latency, what are some of the use cases that require that sub-second query latency?

Yeah. So we often see companies that have user-facing analytics, where their business depends on their users being able to log in and actually see their analytics. We also see a lot of internal use case like dashboarding, Looker, Tableau, those sorts of things where you want to build a slice and dice your data and explore the data without waiting 15, 20, 30 seconds every time you issue a new query.

Eric Dodds 5:42
Yep. Makes total sense. Okay, Kostas, I’ve been monopolizing the conversation, as I often do.

Kostas Pardalis 5:49
No, no. That’s an amazing introduction. So let’s talk a little bit about what you did before you got to Firebolt because you mentioned before Firebolt, you were working with Druid, so there was like a lot of real-time kind of like use cases that you were working on. So how is Firebolt different?

Benjamin Hopp 6:13
Yeah, great question. So the biggest difference is a separation of storage. And so druid requires the data to be loaded to the actual processing servers prior to being able to run a query for whereas Firebolt, we aggressively cache data, but your first query can actually go fetch the data from your deep storage in S3. So you don’t have to wait for a cluster to start-up and fetch all the data before you can query. And it allows you to spin up multiple, what we call engines, but it’s really just clusters of compute resources, independently of your data. So if you’re just doing a small amount of compute, but you may have lots of historic data, you can have all of that data stored in S3, and a fairly small amount of compute that’s actually being utilized because you’re only querying a small sliver of the data at any given time. Whereas druid if you want to have all of the data available for querying, you need to have all of the data loaded with servers, so it’s more efficient in that sense. Druid does have some advantages, no doubt, especially as it pertains to streaming data, being able to query simultaneously your batch data and streaming data is really useful for those use cases that really require that sub-minute ingestion latency, as well as direct integration with Kafka’s are real nice feature.

Kostas Pardalis 7:36
That’s interesting and created like a five-minute follow-up when technologies however, there will be battling. You must be known as Trino. Can you forgive ClickHouse. Three very, like they belong to the same category of solutions they were the weavers like for similar use cases. Would you say Firebolt is part of this family of products?

Benjamin Hopp 7:59
Very intimately part of that, yeah. Under the hood, Firebolt actually is using some ClickHouse code for the Compute Engine or forked from ClickHouse. Now we use a completely different storage handler. So that’s what allows us to separate storage and compute, because otherwise ClickHouse does require the storage to be local. We also use a completely different query parsers. So our query optimizations are all built in-house. And then there’s some other tweaks and things like that, but the actual have engine the computing the bits behind the scenes, that’s all based on ClickHouse code originally.

Kostas Pardalis 8:33
That’s super interesting. Why ClickHouse is not like one of the other two?

Eric Dodds 8:41
That’s a loaded question.

Kostas Pardalis 8:45
About Druid, too.

Benjamin Hopp 8:47
Obviously, our goal. At least the goal that was holding me I started after the company was founded, so I can be sure of these things. But from the stories I’ve been told, the goal has always been to make a true, full-featured Cloud Data Warehouse. That means being able to handle all data warehousing use cases, and ClickHouse can the best position to do that? Whether using Pino or druid, they don’t have very good join support. And I guess both Well, I’m not sure about Pinot Druid is Java-based? I think. Yeah, I think it is also one there’s some overhead there. So being a C++ native application kind of gave ClickHouse edge and having the flexibility to extend it and build it into a full-featured Cloud Data Warehouse rather than Kerberos specialty streaming solution.

Kostas Pardalis 9:43
How was that position at Firebolt even marketed when you have like a couple of different companies out there that offer ClickHouse?

Benjamin Hopp 9:51
I don’t think that we have a negative relationship. I think that those ClickHouse managed services, people that are very familiar with ClickHouse very well. But ClickHouse is not a simple product to get in and use. Firebolt is built for simplicity, we aren’t just a wrapper for ClickHouse, we have fewer features in ClickHouse. Because we want to make it user friendly and, and stable and all of that. So we’re, we’re not just a ClickHouse, kind of fork, we are our own thing, although we’re using the ClickHouse engine, but our sequel dialect is completely different. If you try to use a ClickHouse function and Firebolt, you’re not gonna have any luck doing that.

Kostas Pardalis 10:40
So a detail, for example, has this concept of the learned things in the validations?

Benjamin Hopp 10:44
Yep.

Kostas Pardalis 10:45
Is this something that blade can be pre-configured by the user of Firebolt, or this is like, super by you? And it’s like part of how we optimize the engine to deliver the truth?

Benjamin Hopp 10:57
There’s no concept of those different table engines in Firebolt, we do have a concept of a couple different table types. So we have a fact table and a dimension table. Behind the scenes, what that means is a fact table is sharded, across all of the nodes in the cluster, whereas a dimension table is replicated to all the nodes. On top of that, we have a couple of different indexes. So we have what’s called an aggregating index, which is really just a materialized view. That is always updated. As you ingest more data, you can set your aggregations and your dimensions very similar to like a druid roll-up in Firebolt. And then the join indexes, which is a in memory join, to really optimize performance. And our goal is to provide an out-of-box experience that everybody gets good performance. But if you have specialty use cases ahead of time exactly what aggregations are going to be done, and you’re going to be running those potentially hundreds of times per minute, you can optimize for those specific use cases.

Kostas Pardalis 12:01
That’s very interesting. It’s very smart, the way the facility, these features are productize new way, how you agree like a product experience on top of like something that is things like a materialized view or like how you distribute like read leaves, like a table is going to be distributed or rolling with your was that, like, that’s where I find this really interesting. Because like, that’s exactly how the product works, right? You get what the customer repeats, and you mark the devoted profit and stuff like absorbed into it. Nobody needs to know behind the scenes, what is happening there, so that’s great. So have you worked with ClickHouse? What’s your favorite part of ClickHouse?

Benjamin Hopp 12:52
I am a big fan of the aggregating indexes, because I come from an old school world of, of databasing, where you created summary tables you had I use SQL Server analytics services, to build data cubes and summarize data and being able to get that same effect of pre-computing all of your aggregations, but not having to wait for a nightly refresh. And being able to build those on the fly, I think is really cool. And then the automatic query rewriting so as your users are writing queries, or your BI tools, writing queries, it’s going to automatically use the aggregating indexes that are available. And you can have multiple aggregating indexes, or query plan all automatically choose the best one for the query. That, again, going back to my history in Druid had a concept of roll-ups. So as you ingest data, it’ll aggregate it to a certain granularity. The aggregating indexes allow you to do that same thing, except you can aggregate to multiple different granularities, you can aggregate on a field that isn’t time-based, like you need and drew it. So it’s a lot more flexible, and but at the same time, remaining user-friendly, as opposed to rolling your own materialized view in another system.

Kostas Pardalis 14:17
And what’s your favorite Firebolt addition?

Benjamin Hopp 14:21
The query planner. Firebolt has its own query planner to optimize queries. ClickHouse has no real query planner does exactly what you put in. So when you actually release your product to the world, and people write queries, and some of them are not optimized, sometimes they’re doing massive joins, and there’s no push-down or anything like that. So having the query planner automatically do those optimizations, use materialized views use the join indexes all of that. That’s a huge benefit over using just raw ClickHouse.

Kostas Pardalis 14:57
How was the experience of building this optimizer? I’m asking because I know that like it’s going to lead toughest problems and like database instance reps. And one of the top there’s probably like something that can be solved like a VA. Like, it’s very discussed, they’ll be collecting computer science isn’t allowed. So how was your experience building an optimizer?

Benjamin Hopp 15:22
Well, you might have to talk to people that are slightly smarter than me because I didn’t build the optimizer. But I think that it’s an ongoing process, we’re always encountering new problems and finding new ways to optimize code, and how frequently we get data from Tableau or Looker. And it’s generating queries. And we have to kind of understand what it’s trying to do, and then see if there’s a way to do it better. And our solution architecture team, one of their core responsibilities, is to take sequel code that customers are generating, and find ways to optimize it. And then we provide that information back to our product and engineering teams, so that they can build those optimizations back into the product, and ultimately, kind of make it more user-friendly.

Kostas Pardalis 16:09
Makes sense, yeah, that’s like inventing new things. Fourth product problems like good earning up to my leg kind of feedback in between how the customer experience drives something so deeply technical as an optimizer at the ends. I think that’s one of the most interesting things on link, both engineering and product teams have to experience in working product fake five buildings, I find, like really, really fascinating. So let’s, that’s super interesting.

Eric Dodds 16:37
I’m interested to know, where does Firebolt fit into architecture? You mentioned that you want it to be sort of a fully-featured Cloud Data Warehouse, or that’s what it is, which actually sounds different than maybe some of the language that we hear from Snowflake Allah, like a data platform, right, that sort of includes a cloud data warehouse, but also has this constellation of other tooling around it. So when companies implement Firebolt, what I’d love to know is, is sort of what are the types of companies that are adopting it? And then how do they fit it into their architecture? Is it a replacement for sort of a Snowflake or Redshift or whatever? So yeah, just tell us how companies are fitting it into their data stack?

Benjamin Hopp 17:24
We want to be a full-featured Cloud Data Warehouse. I’ll be the first to admit we’re not even there yet. But we’re a data warehouse with some very specific use cases, that frequently, we have customers that are coming from Redshift, and Snowflake and different data warehouses, that they continue using those products in a data into Firebolt. Okay, Firebolt, lacks a lot of the kind of ancillary functionality, a lot of the large-scale data processing capabilities on something like Snowflake. Whereas we’re built for a write once, read a whole bunch of times architecture. Got it, we don’t have right now, row-level updates and deletes. So if you need to make an update to a record, you need to drop a partition of data, which isn’t that unusual in the traditional OLAP world, but people have gotten so used to Snowflake allowing things like that, that for some use cases, it’s just required. But for those other use cases, where they are doing the analytics, where it’s immutable data or, or not frequently changing data, they can kind of peel off use cases and use Firebolt with those, we built kind of our business model to make that very easy. It’s all pay as you go consumption base, you don’t have to sign a contract or anything. So as Firebolt grows, and encompasses more and more features, then you can grow the use case and move more and more off. So we want to be very cost-effective for the use cases that we’re really good at, and then grow into the rest.

Eric Dodds 19:03
Sure. Makes total sense. Do you have sort of a particular type of company or even industry that tends to adopt Firebolt because of the use cases?

Benjamin Hopp 19:13
We oftentimes see more cloud-native organizations, smaller companies that are comfortable with a SaaS Data Warehouse, that they’re comfortable with the data leaving their, their walled garden, their VPC. And we also see companies that usually have large datasets. So Ad Tech Data, gaming data, clickstream data, marketing data, all of these sorts of things that have huge volumes of immutable data are really a natural fit for firing.

Eric Dodds 19:45
Yeah, super interesting. Okay, personal question, you’ve worked in and seen firsthand a lot of data technologies. You’re still working in data. What do you love most about it? Just from a personal level. Or do you? Maybe like, some of us get really deep into a career and it’s like, no going back.

Benjamin Hopp 20:08
I see huge potential in data. I think that everything in technology data is the thing that always provides value. I mean, knowing different programming languages, and being able to work with data is immensely valuable. But data itself is something that is always going to be growing, it’s always going to be around. So I think that there are unlimited opportunities for working in data.

Eric Dodds 20:40
Yeah, for sure. On the flip side, what do you like least about working with data? Or working in the data space?

Benjamin Hopp 20:50
The thing I like least is anybody that can position themselves as an answer to every question, if you are a system that is really good at doing massive data processing tasks, like Spark, chances are very good that you’re not going to be great at doing very fast key-value lookups, for instance. So there’s oftentimes a use case that is a good fit for our tool and use cases that are not a good fit for the tool. And understanding where those have a good fit is very important. But having one product that says it serves every use case I think is just unreasonable. And, you know, I don’t want to…

Eric Dodds 21:43
I was literally gonna say, marketing is the worst part about working in data. And I’m working in data. But I couldn’t agree with you more.

Benjamin Hopp 21:53
I’m sure Firebolt marketing is probably gonna be listening to this podcast so I wanted to dance around it a little bit, but yeah. Other than Firebolt’s amazing marketing team that I could not love more, yeah.

Eric Dodds 22:07
No, but I actually really appreciate the sort of transparency or honesty around saying, like, this is what we want to be. And this is what we do excellently. Now, I think that’s really helpful. And I mean, I appreciate that. And I think our listeners appreciate that, too, where you kind of know what you’re getting as opposed to, because you’re totally right. It’s the like, the disenfranchisement of you sort of look at the site, you look at the product page, you’re like, This is awesome, how the docks are a little sparse. Like, let me try this. And you’re like, Oh, right. I know why the docks are smaller.

Kostas Pardalis 22:42
To be honest, the industry is like at stage right now is like widely. And there’s a lot of innovation happening. So so like the seams change? Sure. From day to day, the show isn’t just, it’s the marketing of fault that it’s not these things. Yeah, it’s just that they understand the market is still trying to figure them out.

Benjamin Hopp 23:02
Yep. I guess the other thing that always kind of rubs me the wrong way is people that make statements based on like outdated information saying that, whatever technology it is, is what it was five years ago, I still hear people saying that, like, they don’t want to use Apache druid because they write SQL, like, you’ve been able to use SQL with Apache druid for almost five years, if not more. So. I guess people should always be kept reevaluating their preconceived notion about any technology or any company as time goes on.

Eric Dodds 23:39
Yeah, for sure. I think we talked about that with a term like CDC (Change, Data Capture). And it’s sort of their country companies doing really interesting things with it. Right, that it’s not new, right? It’s really old technology, even though there are some new companies that there’s some excitement around, but it’s not like—

Benjamin Hopp 24:00
I don’t want to do CDC because then I have to put triggers in my database, and it’s gonna put an additional overhead. I’ve been doing this for far too long.

Eric Dodds 24:12
That’s great. Well, this has been so wonderful to have you on the show. Learned a ton about Firebolt. Kostas, any last questions before we sign off?

Kostas Pardalis 24:20
It’s been great. It’s great that we learn more about the core technology. Before we end the show, tell us about something exciting that this coming in the near future?

Benjamin Hopp 24:33
Oh, great question. I think streaming data coming to Firebolt is going to be huge. We are working on building in mutation so you can do those row-level updates and deletes I think that’s going to open up a lot of new use cases for Firebolt customers. Our ecosystem team is booming. We’re always adding new partners and new integrations into the system. So anytime we can get another partner and learn more about their product and cross-sell and all that always gets me very excited as well.

Kostas Pardalis 25:06
You also have a great team.

Benjamin Hopp 25:09
Yeah. You never know what’s coming out of our marketing department. So that’s always exciting.

Eric Dodds 25:14
You gotta watch the marketers, especially when it comes to data. Ben, thank you so much for taking some time with us on show.

Benjamin Hopp 25:22
Yeah, thank you for having me.

Eric Dodds 25:24
Here’s one of my takeaways. We could definitely count on one hand the number of times that Hortonworks has come up on the show. Even the name Hortonworks sounds a little bit enterprise-y. I guess it is enterprise-y, actually. But it was just interesting to hear about that. And my guess is that Hortonworks probably at Play has played a bigger role in the data world, then I think a lot of the content on the show necessarily gives it credit for that’s my takeaway. Also, the the Hortonworks guy was actually from the East Coast out of Atlanta, which is interesting as well. So yeah, I don’t know, it’s just interesting to hear him talk a little bit about Hortonworks. And kind of the work that you did there. And then of course, like the Drude stuff is interesting. But what was your takeaway?

Kostas Pardalis 26:05
I think one of the most interesting parts of the conversation was ClickHouse and how actually open-source comm few, let’s say, the innovation in the space and REITs, let’s say like a new state in the industry, where instead of like, considering that database is just like a too risky thing to get to market to actually get to a point where we can start building companies and products and iterate fast without the risk of the past for that something that happened a lot with SAS in the past decade, for example. But if you want like to replicate these, let’s say their data-related infrastructure, or when it’s something seemingly right, and it seems that open source, and it’s not just ClickHouse, but I think that in this case, like it’s a very good example of how they took like, the core part of these clones, the product build their own query optimizer on top of the IDS change, like the query parser. I mean, they’ve done a ton of work. But this work was done like on very shortly the core that would sell them, like accelerate the whole process of taking this product to the market, right. And this is needed. So I’m very excited about that. And I’m waiting to see what other like products do something similar like is out there. There are exams like these database space, like we have other days, for example, and landscaping, which is like based on that anyway, that was like probably one of the most, like interesting parts of the conversation. And what really made me excited.

Eric Dodds 27:37
All right, great conversation, several more shows for you from data council. So subscribe if you haven’t, and we’ll catch you on the next one.

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Data Council Week (Ep 1): Discussing Firebolt’s Engine With Benjamin Hopp

April 25, 2022

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter