This week on The Data Stack Show, Eric and Kostas chat with Dhruba Borthakur, the Co-founder and CTO at Rockset. During the episode, Dhruba shares his journey working in the early days of the Facebook team, working and developing platforms like Hadoop and RocksDB, the next evolution of hardware storage, value storage and indexes, and more.
Highlights from this week’s conversation include:
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 00:03
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.
Kostas, this is amazing to me, the people that we get to have on the show, we’re going to talk with Dhruba from Rockset, which is built on RocksDB, which is sort of a legendary piece of technology used at some gigantic companies, you know, to drive things, you know, like your Facebook or LinkedIn newsfeed, just incredible. And he has started something new, which is amazing after how much he’s already built. I want to ask him about early days at Facebook. And this may sound funny, but we haven’t talked in depth about Hadoop, but he’s sort of a Hadoop Master. No, of course, like building a lot of things to solve a lot of the problems that had to do pad, but we haven’t dug into it a ton on the show. So I’m going to actually ask him about that history, and especially that transition at Facebook, because I think that would be very informative. How about you?
Kostas Pardalis 01:18
I mean, okay, but first of all, I think we definitely like to talk a little about rocks, DB and Rockset. So we will definitely spend quite some time talking about that other Sunny, better weather obviously be and what made it like so let’s say important to the industry out there. And also talk about rock sheds, like what Rockset is doing differently compared to other vendors out there. And how Rob’s to be is part of that solution. Right. So I think we will talk about like, the whole, I don’t know, like, probably 20 years of history and from like the Hadoop and you were out to today. So let’s go South with him. I’d love to do it. Amazing.
Eric Dodds 02:14
Let’s dig in through about Welcome to The Data Stack Show, we are so excited to learn from all of your experience. And of course, hear about what you’re doing with Rockset. So give us your background, how you got into data, and what you’re doing today with Rockset. We will Hey,
Dhruba Borthakur 02:33
thanks, Eric, and cost us for inviting me here. I really appreciate your time and am delighted to be chatting with you today. So yeah, my name is Aruba. I am the co-founder and CTO at Rockset. So Rockset is a real time analytical database that we have. It’s a cloud service. And I’ve been working at Rockset. Now for the last six plus years. Yeah, so my experience mostly has been with data systems. So prior to Rockset, I worked at Facebook for around nine or 10 years, building a lot of data platforms. At Facebook. I started off with building a lot of Hadoop, back end data platforms at Facebook, Hadoop highs and HBase. And then I moved over to working on another open source project called rocks dB, which was a key value store that we built from scratch, or kind of productionize it from scratch at Facebook. And before Facebook, I was at Yahoo building a lot of Hadoop file systems. So I was the project leader of the Apache, HDFS project back in like 2006 or so. So I’ve been with this data system for probably like 20 plus years now. So it has been quite the journey. And I’ve seen a lot of different software being used for processing data. So I’m excited to be here and just have like, tried to answer some of your questions and maybe share some of my thoughts and opinions.
Eric Dodds 03:57
If you’d like, oh, we definitely would like that. Let’s start with Hadoop. I mean, I think, you know, a lot of people who are, you know, sort of maybe newer to the data industry in the last, you know, sort of, say several years, may not have had direct experience with the impact that Hadoop had early on, you know, sort of in the world of data. I mean, so many things, right? Data Processing, big data is of course, the buzzword. But you worked on Hadoop at Yahoo, and then at Facebook, right. And, you know, sort of early on when Facebook’s employee number was still in the hundreds, I think you mentioned when we were talking before the show, can you give us a sense of maybe paint a picture of where Facebook was that when you joined the types of data and the data challenges that they were facing and why Hadoop made so much sense as a system in 2008 You know, for you to Build the things that you wanted to build.
Dhruba Borthakur 05:02
Yeah, that’s a good question because it’s not just the software. But also the hardware had something to do with this site or why Hadoop became so popular. So back around 2006 2007 and eight, storage became cheaper and cheaper, like hard disks like a gigabyte size hard disk for cheaper and cheaper. Yeah, and then I’ll bite size discs. Prior to that storage was costly, right? So you have Saturday’s technological SATA asset to this service. So they became very cheap in price per foot, like price per gigabyte. And so people could buy these discs or a company like Facebook could buy these discs. Now the question is what software shall I use to store data in this disk? Right? So this is the reason why I think Hadoop became so popular. I mean, I wrote a lot of Hadoop file system code. But I’m not saying that is like the world’s best code I have written right. It’s good code, it does his job. But the real challenge for us like the promise of this system is that I have maybe at that time, 100 terabytes of storage, what software should I use to store data? Can I use all right? No, I cannot because I run out of money, because it’s very costly, right. So that’s the reason why Hadoop became really popular. And I really loved working there, because the datasets at that time were a few 100, maybe 10s of terabytes to a few 100 terabytes, this was a very big data set. But within a year or so it became like a petabyte or so and then moved petabytes very easily. The challenge again, was like fault tolerance recovery, can these automatically be handled by a software because these disks are cheap? They could fail anytime or liquid. So it’s not like the high quality hardware that you have, you have low quality hardware, but so much of it that you need software to manage these things, right? So fault tolerance was super important. But again, Facebook invested a lot of resources here, because very early on, I think the focus was that if we ever want to monetize a social platform, then you’d have to deal with the datasets of how we’re behaving and reacting. Those kinds of things, right? So 2008, it was mostly about growth. For Facebook, how can I use the data to make things more engaging for customers? But back in 2011 or so is when I used Hadoop to monetize my platform. Like how can I show better advertisements? How can I show and even advertise, so again, the same technology, but used for very different use cases in the company’s lifetime. But those systems are very bad systems, right? Hadoop is a very bad system. So you could just AWS, it’s put a lot of data, and then you could look for intelligence, you could look for mining, some nuggets of information that you need to make your, let’s say, your advertisements better, right. But yeah, I mean, that is kind of how Hadoop became kind of like I said, the granddaddy of all these data systems, it made it very easy for people to store data. It didn’t make it easy to query and make sense out of it. But your store of data is the easiest platform there. That was
Eric Dodds 08:17
So can you explain how in your time at Facebook, because you were there for some time? How did the teams change around the technology? And when was there a huge migration from Hadoop? And actually, I mentioned they are still running some of that old infrastructure? Or do you know?
Dhruba Borthakur 08:39
No, I know, I think that they continue to run some similar software, like homegrown software, right, because they wanted to improve their back end well, but I did see a very clear kind of what should I say, part of how the data systems are evolving, right? So back in the days, it was Hadoop back in 2008, storing a lot of data. And then a few analysts will come make some queries, queries will give some answers in three hours. And then they will read on some queries, the cycles, iteration cycles, were probably a week or two before you actually can get the intelligence out and feed it back into your product land, right. But then, after I think the company became public, around 2011, or 12, that time there was a lot of focus on monetization, and sort of monetization. Focus was how can you make the systems more and more real time? Because if so, if a user logs into Facebook, you need to show him the right advertisement at the right time based on what he looked at. Where is he located currently, what is his geo position and you have to do kind of complex and mobile had
Eric Dodds 09:49
rocketed because in 2008 it was still pretty early, you know, the iPhone was like, very young and some mobile’s, you know, skyrocketed.
Dhruba Borthakur 09:55
Yeah, I mean, surprisingly, Facebook didn’t have a good mobile product in 2012 inlaid into the oil product. Yeah, fascinating. But far later in the game, but Yeah, real time became super important. And that’s the time when I actually started to work on another project called the rocks team. So this was a natural progression of events. So Hadoop was good, but we could not make it in real time. It is a very batch system. So the two main use cases that started to use Hadoop but needed more real time was one is obviously ad placement and ad showing. The other one is about spam detection. If somebody is like, posting, like a bad URL to Facebook, we need to quantify need as quickly as possible. Otherwise, there are all kinds of problems, right? Like your problems normally girls, some financial issues, everything. So we need to quarantine these bad posts immediately. So Hadoop cannot really keep up with these kinds of workloads where you need to react quickly to new things that are happening in your data sets. Yeah, so this one, I got chartered into writing something called rocks dB, it’s a database. Again, it’s a key value store. But it is basically low latency queries on large data sets. And there was a hardware change that happened again, and this time 2012 Is when SSD became really cheap. Right before the flash drives were costly things. I mean, you think twice before you buy, flash drives you people mostly store data and hard disks to the goals 2013, SSD prices just started to fall through the rules. And so the way we build rock DB is that we can leverage the power of the SSD devices, and build a database from scratch, so you can get low latency queries on this large data set. So that was kind of the natural progression there. And then, much later, after that, it was all about building more kinds of reactive systems. Before that things are all the systems that I told you are very much built passively. But then we had more reactive systems where a change here, kind of like what I said is it went more to like a data flow kind of model where you make a change here produces events, and then goes and affects other systems on the side. So the data platforms evolved over time, which are more proprietary to Facebook, and not like open source software. But yep, Menardo Ceph Flink are some other open source software that we are familiar with, we build some things, some of those things similarly over there, where things became more and more reactive. So I can see a real change, like every five years, I think, from kind of evolving and taking a completely different skin. All of the cores probably might still be similar in nature. Yeah.
Eric Dodds 12:49
So I’m interested to know a couple of questions, but the first question is, how far did you push Hadoop? And what did you know when it was time to explore a new solution?
Dhruba Borthakur 13:02
I think it is all driven by the market. Right? Like what are our developers demanding? So Facebook, I mean, backend data engineer writing data, infrastructure code, but then there are a lot of people who are writing applications. Yep, I take, for example, the Facebook app, right? The Facebook app, like when you fire up your Facebook app in your phone, and you see all your feeds and posts. That is a data app you share on. And that’s one of the world’s first data apps that needed to process so much data and give you results very quickly. Yep. So we also saw that if you’re real timeliness of the Facebook feed is important. Like if you see your friends post, immediately, you have a better engagement versus if you see us after 15 minutes. So this is a market driven thing. So now he said that, okay, we need to build better data systems, which are more real time like if somebody comments on photos everywhere, then that photo should somehow be highly ranked in your feed. So we built Facebook. We build something called a newsfeed, which is essentially the backend, which powers Facebook apps, and that users rock dB. Now, again, Hadoop, we just cannot use those for those kinds of latency, low latency that are user facing. So most of these are, I think, application driven. When applications start using datasets, then the demands are different versus business analysts using datasets like Hulu. Mostly business analysts use data sets so that data scientists use data those systems right to answer like, what is questions, those kinds of questions, but when the application started to use data, one first application is the Facebook app, then the demands were very different, it cannot be batch, it cannot be stale data versus live data. So all those things are all driven by applications requiring to do more intelligent things to data versus just doing offline analytics.
Eric Dodds 14:57
Sure. And so Okay, so when you started to work on Hadoop stuff early in 2008, you know, a couple 100 people at the company, you were trying to drive growth and understand, you know, sort of the dynamics of the social graph, etc. Fast forward a couple years, you now have an app that has real time demands. How did you decide to build a rock dB? And what was available to you at the time? Like, and I know, Facebook builds a lot of stuff, and it’s a very sort of engineering forward culture, especially historically. But can you describe that process? And you know, who ultimately said, Okay, we’re gonna build, you know, sort of a low latency key value store type, you know, query engine?
Dhruba Borthakur 15:43
Oh, yeah. So I remember that a certain step functions in this process, right? So, earlier times, in the earlier part of the real time journey, we Facebook, we have Facebook used something called HBase. I think you guys might have heard about it. It’s basically a database built on top of Hadoop. Right? So we tried to use space, it was a data system that was powering some of our ads back end. Right. And then when I saw Facebook also had a very cutting edge engineering culture at the time of trying maybe 50 experiments and letting the body fail off if they failed, but at least five would be sure.
Eric Dodds 16:23
Yeah, I remember that. Well, reading all the, you know, the blog posts about this. Yep.
Dhruba Borthakur 16:28
Yeah. So I try. I can, I mean, as I worked closely with some of the upper management to say, hey, shall we build something which is actually better than HBase. So I got maybe around eight or nine months of time, I looked at other solutions that are out there. Facebook also had some existing solutions. But I figured that this does n’t really leverage the capabilities of a flash storage system, right? Why God, kind of a charter saying that, hey, can I can we can I build something where you can give low latency queries on flash drives, and I build something and I had to one more people, one more engineer with me first. And then after only eight or nine months, we could replace that 500 node HBase cluster with a 32 node rocks db cluster. So when you could do this, then other teams at Facebook figured, oh, this is technology that is disruptive? This is not like a 30% Better technology. You see what I’m saying? 500 node HBase cluster getting replaced by 32? Node rocks DB class, like massive orders of magnitude? Yeah. Orders of magnitude, right? This is what I mean by step function. When a step functions like this. We can show it to developers, they believe everybody is very up to date and understand technology. Well, this Oh, yeah, this is very different from previous generation suffer. So then immediately, there are like five other use cases, and then more people in the team. But it’s the first six, nine months of or maybe close to a year is when you are mostly working on a belief on saying that, okay, if I build this by leveraging these things will be far better. And then I think after that, it was like, kind of self driving good engineers join the team, because I think God is great technology. And because they’re excited. Yeah, I should like a self fulfilling prophecy after a while. You’re going from zero to one is the hardest, I think, yeah.
Eric Dodds 18:30
Can you describe the in building rock CB, you know, there’s sort of two dynamics driving it, like the limitations of HBase. And then sort of the ability to dream about what you could do without limitations. What? How did you balance those two drivers? Were you more focused on what was possible? Or were you more focused on overcoming limitations?
Dhruba Borthakur 18:58
I think when I say so, it’s like, when you look at HBase, I think it’s a great product of the time, but it feels like, but when you’re surrounding changes, then I think you need a different kind of product to evolve along with your surroundings. So this is what I mean by when the hardware changes. It’s like So HBase was built for hard disks yellow, what the optimum is six times on the disk. 10 milliseconds is your seat time on a disk. Whereas an SSD is there’s no seek time. Everything is random. I get like, microsecond latencies from the flash drive. So yeah, this is what I mean, I think I tried to understand the limitations of the older system. And I tried to look at what the new hardware offers me and how can I leverage this to build higher layer applications shared on the stack? I think it’s both sides of your question, essentially. Sure. It’s about overcoming limitations as well as kind of dreaming, saying that, if I can overcome Companies limitations because the hardware is helping me, I can actually enable all these types of applications that were not previously possible. I think Beaudry, Hadoop really made good, be good as storage possible before that it was not possible at all, and then drops DBM more low latency query engines let you really store fast, or let you literally access data fast from SSD based devices, and enable all these good applications that are out there nowadays. Yeah, I can give it to most data applications also if you’d like. But yeah, that’d be great. I mean,
Eric Dodds 20:40
who’s running rocks DB are running sort of their, you know, their applications on rocks dB.
Dhruba Borthakur 20:47
So rocks DB is a key value store, I’d say very, like high performance, low latency c plus back end. So Hadoop HBase, I wrote a lot of Java code, because those are good systems as well. But then when I tried to focus on performance and low latency, I love C plus and build rocks dB, so rocks dB. So Facebook newsfeed, which is your Facebook app that you use every time a news update that’s served from Ross to be based on the back end. Similarly, a lot of data platforms inside Facebook, which also deals with lots of analytics, are also Rockset. Based, open source wise. I think Kafka uses Kafka streams and rocks dB. Internally, Flink uses rocks DB internally. LinkedIn feed, I think, obviously just rocks to be internally, again, some of the blog posts, this is what I have learned these things. It’s not like I have proprietary information there. But there’s a whole bunch of companies now who use rock dB, inside their own software. And, of course, at Rockset, which is where I currently work, we use rocks DB a lot because we do data analytics, which is all focused on real time analytics and rocks. dB is kind of our building block. So we have something called rocks DB cloud, it’s an open source project as well. It’s a sub project of Rockset. And it lets you run rock dB, well on the Cloud Platform. So the reason we do that is because Rockset is a purely cloud based service. And all our data we store using rocks dB, because rocks DB is essentially like a very powerful indexing engine. So use rocks DB as an indexing engine for all these analytical data sets. And that’s the reason why we could serve low latency queries. I mean, that’s one of the reasons not everything. But one of the reasons is that yes, we can fetch data from large data systems quickly enough, using the rocks DB indexing engine. Yep.
Eric Dodds 22:48
Makes total sense. Two more questions, because I’ve been monopolizing the microphone. And I know Costas has a bunch of questions. One is technical. And then one is about your time at Facebook. The technical question is, what was the most difficult challenge you faced in building? Rocks? dB?
Dhruba Borthakur 23:09
Good question. So I think when you’re building infrastructure, right, like data, infrastructure, or any kind of infrastructure, right, the question is always about price performance. It’s not about performance, right? It’s like, think about it. It’s like the pipes in your building, right? Like the water pipes in your building. They have to sustain some pressure, they have to be cost efficient, because you don’t want to spend too much money on a lot of these grapes. Right? Meltwater is not functioning or building, right? Same thing with a lot of this infrastructure. I think it is price performance, which matters is not just about functionality and features. So the focus again, I was at Facebook, and there we are talking about scale, right at scale, we can build something building infrastructure. The first time is easy. Building it up to scale is the challenging part. Because it’s so the biggest challenge is how can we make it efficient and cost effective and kind of leverage, or extract everything we can to make sure that you can get low latency queries, huge number of QPS. And make sure that the hardware you’re running or is the cheapest hardware, so they don’t have price performance challenges. So I would think that measuring performance, benchmarking, iterating and making sure that it does power, real time analytics, those kinds of weather challenges. It’s not one thing, it’s a series of things, but it’s all focused on performance. So yeah, performance is the key differentiator for rock dB, compared to every other key value store or database that is out there.
Eric Dodds 24:48
Yep. Love it. Okay, last question. This is about your time at Facebook. Do you have any fun stories about interactions with Mark Zuckerberg, you know, because you were there when it was, too 200 or so in play? So you had to be in a meeting with him at some point if you were working on sort of core data infrastructure.
Dhruba Borthakur 25:07
Yeah, I mean, obviously, the first time was the interview session. So because he was a person who is very hands on, right, yeah, he knows everything. I mean, at that time, at least he knows everything. But over time, I think I really like the fun part. In my mind, there are very few people that I know who have a great sense of technology and product. So I think Steve Jobs is obviously one I’ve read about him as a result, but this is another person that I’ve seen from close. And I think there are very few people like that who have great technology, interest and understanding but also understand products so well. It’s amazing. Yeah, I mean, they’re a lot of fun stories. Otherwise, yeah, like that time Facebook used to be like a really small company. So yeah, we used to be in downtown Palo Alto. they’re like seven buildings. I was below a Quiznos so I’ll dungeon and then because you get all the Quiznos was like a sandwich place and you get a
Eric Dodds 26:12
nice smell. Smell the food cooking next door.
Dhruba Borthakur 26:15
Exactly. Yeah. A lot of fun. Yeah, so very cool.
Eric Dodds 26:23
I’m sure it was just so inspiring to work with someone like Mark Zuckerberg. Okay, Costas, please jump in here. I could keep going. But please jump in. Yeah, it’s like,
Kostas Pardalis 26:34
you can like it’s such an interesting conversation like I really enjoy listening to you like, talking. Okay, I have a question though. And, like, I don’t know, it’s, I find it super interesting that there’s like a pattern in what you were saying. Robust far like, we had one store ads revolution or evolution. And a new software technology came out of Hadoop. Then we had the next one. We went from SATA to SSDs. We ended up like having herbs, the Bee and like this whole, like farming Leo’s that are based systems that take advantage of these new store ads. What is that? Like? What do you expect? Do you have like a prediction about what’s going to be the next evolution in hardware in store as that’s why it’s yet another evolution like NATS
Dhruba Borthakur 27:33
It’s awesome that you’re asking me this? Because I think this is the reason why I started Rockset. So what happened, at least in 2015, and 16, I really saw that the cloud is becoming really popular, right? The cloud, I think, is a different piece of software. Now in urban, right cloud. The reason cloud is a different piece of hardware is because you can provision new hardware by using a software API in order things. So give me 1000 machines, there’s a software API to get 1000 machines. In the old times, you’d have to provision you’d have to get set up racks and put a data center, right. So the reason I’m really excited about this new phase that I’m working on is because the cloud has become really popular. And the cloud is the third type of hardware change that I have seen now in my lifetime, right, like first SATA disks, then SSDs. Now it’s the cloud. And what is different in the cloud is that you could provision hardware and by using a software API, yep. Yes, machines are critical CPUs, he could get storage, whatever else, right. So this is kind of the vision for Rockset. Is that how can we build a cloud database? The primary reason why Rockset is price performance. Again, price performance is my key for every software that I’m trying to build. The reason Rockset is best priced is because it’s built natively for a cloud. It’s not something that you download and install on your data centers on your machines. Take for example, this database compared to all other databases, we have complete segregation of storage and compute. So a database where storage and compute is together at the Rockset can separate these two out is great for applications because if you have a lot of data, you need more storage. If you need a lot of queries, you need more compute, but it gives you without giving without being slow, is the problem. Right? Like if desegregated, many other systems are there if you disaggregate your queries are slow, but the key for us is that how can we build a desegregated system, but the queries are faster than existing systems that are out there. So that’s one and the second. Second, I see the changes that almost everybody is moving from steel analytics to real time. I learned and like it if you look at EMR, right, AWS, EMR, or even Snowflake. They’re all about data analytics, can I get data 15 minutes ago and run some analytics on it? Whereas for Rockset, it’s all about real time. How can I look up data that just got produced a few seconds earlier or a few minutes earlier? And take action on it? It’s not people who are taking an action, it’s other data or software, it’s taking action on the data. So yeah, I mean, I think those are the two trends that I see the hardware change about the cloud, and the market is just ripe for evolving their systems to produce new features or new facilities for applications. So would you say that, like Rockset is,
Kostas Pardalis 30:42
let’s say, like a piece of infrastructure that someone would use to build other software, or it’s closer, let’s say to, like, data warehouse where people will go and like, use it to do like, even in real time, right, but like still, like ad hoc analytics, or like reporting? How do you position the product itself, like in this landscape, to be honest, is like, pretty crazy, right? Like, there are so many things happening. So that’s why I’m asking
Dhruba Borthakur 31:15
now. Great question. Yeah. So what happens is that this is, again, the trend that I see right? Hadoop and other systems, they may do data analytics or analysts or quants look at data, more analog analysis. But I think what is happening now is that it is software who is using this data. And I’ll give you examples, right? Take for example, we have a use case where, like, the largest payment system is using the largest payment system micro financing system in Europe, they’re using Rockset. They’re getting events from all the transactions that they’re doing, right, but then they want to quickly figure out which events or which payment systems are fraud or scam, then you take action if they quarantine the action within a few seconds, versus immediate or short, this saves a lot of money. You see what I’m saying? That’s one example. And again, this is an application that is running, no analysts are sitting and looking at upgrades on Rockset. Another one, we have a good big airline who is using Rockset That Airlines is doing. So when you buy an airline ticket, the price of the ticket is different on different days. And so now they take feedback off on demand and supply to figure out what is real time ticket pricing. Boxer for this, again, knows, people are quitting this data like yeah, like say travel agents, or whatever to buy tickets at the end of the day. But the back end is the one that uses scoring systems like Rockset, to figure out what the current price of this ticket is when somebody is flying from one place to another. It’s all about automatic systems, making queries on data sets is not about manual people doing ad hoc queries and figuring out those are also there for Rockset. Because Rockset has a sequel interface on Rockset. So what Rockset is, it’s a rough DB based database, but we have a sequel interface on it. So you can do standard SQL using join segregations group by sorting everything else. So people find it easy to use, because rocks DB is a c plus backend, right? Not everybody, not every data person or a developer can write c plus code or should. So we have a very standard sequel over REST API. Easy to use, but you get the power of DB on the backend. So it is a kind of marriage for both sides.
Kostas Pardalis 33:38
So your question about how, where did the usages come from, and some follow up questions about to be honest, too. Okay, consider let’s see some other real time browser DAG data processing systems, like druid or ClickHouse. What’s the other one? Be? No, these are the ones that really come to my mind right now. How is this different? Like between these systems?
Dhruba Borthakur 34:10
Good question. Yeah. So what happens is the druid I think the greed project started probably in 2008 or nine. I mean, it’s been around
Kostas Pardalis 34:19
for a while. Yep. Yeah.
Dhruba Borthakur 34:21
So what happened is that for more greed and pieno, or even Snowflake for that matter, right, they all leverage the thing about columnar organization of your data means that if your record is 50 fields, the store every field in its own column, and then the processing is that can I scan this column as quickly as possible? Right. So all of those systems are column based systems. And the query is all scan based, which means how can I paralyze my query and scan every column as quickly as possible? So that works when there’s ad hoc queries, right? But now when your QPS increases, let’s say, an analyst is making a query, I mean, he can probably make a query once every 10 seconds or whatever. But when software is making queries, the QPS of the system is high. So let’s say there are five, QPS, 10, QPS, hundreds of QPS. So just imagine the amount of compute, you need to keep scanning this data set again, and again, for every quick thing. If you’re looking for a war to your book, if you look at the 500 page book, it takes you so much energy and time to find the string you’re looking for. When going to the end of the book and looking up the index on the book and saying, Oh, this is the string I’m looking for. So Rockset is built with an indexing technology and not a scam based technology, okay? means that when a query comes in, we don’t need to go scan all the data again, or again, for every query, we leverage the index very efficiently to figure out where the data that is matching the query exists on return. It is basically the difference between Rockset and all other systems, which includes droid, Snowflake Pino, everybody else. Some of these systems are trying to think about building an index now or how can I make an index manually? For Rockset, every field is indexed, we call it the converged index. So our converged index is the differentiator, why our queries are fast versus our order scan based approaches that are out there. So that’s one. And the second one is that we work only on the cloud, versus Ino and ClickHouse, and everything else. These are all pre cloud software, which basically means that storage and computers together is separate from the storage and compute if you need to, just like Hadoop, like nothing wrong with it, but it works well, when you have your own data center with your own machines. In the cloud system, I think it’s super important to be able to separate compute and storage. That’s the only way for you to scale up and be cost effective. So none of the other systems can give you segregation. Rockset can separate query, Compute, Storage compute. And because it’s a real time system, it also segregates ingest computers. So the three years of disaggregation is compute needed for rights, compute needed for queries, and, and storage needed to store your data. So Rockset is essentially a three year disaggregated architecture, which is why you get the best price performance. If you use Rockset. Again, you could do similar things on ClickHouse, or Snowflake, but the price performance would be very worse, compared to the one that you use if you use Ross, basically the difference? Am I able to explain it?
Kostas Pardalis 37:42
100%. Okay, and you mentioned, like when you started talking about the ropes to be like the first thing that you said, it’s like a key value store, right? How do we go from a key value store to that index that you’re talking about? Because you don’t necessarily need to have any place to have a key value store, right? Like, key value stories? Exactly. Not like it’s like a hospital, right? Like, I have a key and I want to go and pull the infant, the information is based on the key that I have. So tell us a little bit about that. And like how these like he’s this first of all, like part of rocks dB, or this is like more part of Rockset?
Dhruba Borthakur 38:18
A great question. So Rockset uses a db to build an index, right? But it is. I mean, there’s precedent for doing it. And I saw it being done at Facebook as well. So I’ll give you an example. Right? On Facebook, we used to use rocks DB for storing the social graph, like let’s say these are, this is the username, and these are the post IDs, right? social graphs. But then very quickly, on Facebook, I’m talking about how you use rocks DB there. We also want an index based on the geolocation. You see when you want to go to a geolocation St. Golden Gate Bridge is the key, who all has posted photos and the Golden Gate Bridge. Now that’s a secondary index on the social graph. Application of Jesus Rocks db to be the secondary index on that entire social graph, and 20 petabytes of social graph data, we use rocks db to build a secondary index to be able to serve queries like show me all my friends who visited this location between these two dates. So there, this is what kind of inspired us to build Rockset as well. So in Rockset, people actually have data that we store in rocks dB, but we also use rocks db to build a secondary index on every field. So that when your queries come in, you ask arbitrary questions, the data and all queries are fast. Other systems like decK for example, ClickHouse, for example, right? If you use ClickHouse, that’s good if you’re making a query, but once you want to make a change in equity, and you say I need these filters. Now you’d have to go talk to the ClickHouse database administrator or whoever is managing the database saying that I’m going to make this query now. Can you create a secondary index on these columns? Can you normalize this data so that I know my queries will have this additional filter? In Rockset, everything is pre-built for you. So you can actually make queries on any of these systems, or any of this data set on any fields without having to re-ingest this data or Yeah, made the cost of indexing so cheap that people don’t think more about, oh, in the indexing cost. prior generation databases, the thing that indexing is costly, right. But because of cloud friendly lessons, separation of compute, and storage, we can build in this is really cost effective. Using Rockset, you don’t have to think. Shall I build an index? Or is it going to be prohibitively costly soon? Yeah, this is kind of the thinking change in a developer’s mind of why, when and how to use Rockset?
Kostas Pardalis 40:51
That’s super interesting. And okay, I kind of like, I don’t know, maybe it’s a bit of a dumb question. But in my mind, there’s always a trade off. When you’re relaxing, in terms of like, how fast you can write your data, right? Like, as you started, like adding more processing during the integration process, the slower the process is going to be right. And that’s like, I think you have, like, systems that are extremely, let’s say, optimized for it, right. And I think rocks DB is like an example of that. You can literally rather build like a data system based on work that’s going to be extremely fast and write data, right. But if you start adding indexes there, and you want to take latencies low, and you also want to have really fast ingestion, and at the same time, like being able to serve the indexes to your users to use them. How do you do that? Like, how do you balance and like, do the right trade those there? Because like, at the end, that’s what engineering is, right? Like, figuring out the right trade offs. So how do you do that?
Dhruba Borthakur 42:08
Exactly. Yeah, I think so you’re absolutely right. Indexing basically means that you need more computation when you write data, because now every byte that you write needs to be indexed, right? So the fact that so let me explain to you how, why it is easier or cost effective to do it in rock dB. So rock’s DB is an LSM engine, a log structured merge tree, so it is unlike a B tree, or a binary tree, or whatever else that other databases use, like Postgres, or MySQL, right? So for prior generation systems, if you do it right, a database needs to go read a page from the storage, and then update it and then write it back. So there’s a read, modify, write for every right that’s happening. Whereas for an LSM engine, like rock dB, and a new right happens, it all goes to a new place on the disk and doesn’t overwrite stuff. So the right rates are similar to what you see on a disk device. For an SSD device. If an SSD device is able to do 500 megabytes per second writes, rock DB can keep up with it, as long as you have some computer associated with that storage device. You see what I’m saying. So it’s a very different right rate compared to the binary tree that most databases used earlier. So that’s one thing that Rockset uses a lot, which basically means that we can write that rocks DB speeds. And the other thing is that in Rockset, the way we share is different from most databases. So if you use HBase, or if you use Cassandra or other database systems, when an update happens, and you need to build indices, the update will be here, the ingest is up on different machines. So you need Paxos raft, or some other protocol to be able to keep all the machines in sync. So Rockset doesn’t do what Rockset does, what it does is basically it’s document sharded. So a document goes to one machine on the cluster. And all the secondary indices are building in that node. Secondary indices are not spread out among other machines. So REITs don’t need draft or any other Paxos and garchen sheets all go to one machine. For these two reasons is why we can sustain highlight rates, I’m talking about like, say 500 megabytes per second write rates on data systems that we have, which is constantly indexing and storing data in ROSS dB. And you don’t have any kind of provision saying that because again, because in the cloud, you don’t have to provision for peak capacity, right? Because we can get machines when needed. So this is why this is now economically feasible for users. You see what I’m saying in the old times, was not economically feasible, because I had to provision for peak capacity and I had to buy all my machines and keep them there. When my highest rating happens, but in the cloud, that’s not true. Maybe your highest rating happens from nine to four o’clock in the daytime and all the other times. Like let’s say you are looking at market data or something like that, right? Half of the time your market is not alive or like other worlds, the stock market has so many other reasons why these kinds of indexing technologies are becoming cost effective at scale now.
Kostas Pardalis 45:17
Yeah, it makes a lot of sense. That was super informative. All right. So, what’s next? And when I say what’s next? It’s two questions. I’ll tell you what you see as next for the industry, overall? And what is next for Rockset? What’s the next thing that you really anticipate like to see you’re going to live on the product?
Dhruba Borthakur 45:44
Yeah. For Rockset? Again, is it? So let me answer your first question. The second question first, right, because then I can explain where the industry is going. So Rockset, we are cloud service. So we’re constantly making improvements to our back end and shipping new products. And the thing that excites me the most, is that most data systems, I feel currently are not great at giving isolation of different applications running on the same data sets. You see Erlang, I have had hands-on experience with Hadoop and Kafka and all these other open source technologies. There is no good technology. But when you want to use it for five different applications, on the same topics or the same database, it is very difficult to do. So. So this is something that Rockset is innovating a lot where you could have one storage, one database, but you could have five different applications. One is like a real time ticketing application. One is a fraud detection application. One is a marketing application running on the same database, completely separate from Compute Engines. But they all see live data that is changing in a database, without the customer having to copy data from one place to another and make sense out of it. So from multi-tenancy, and the ability for different apps to leverage these large datasets with the least amount of complexity is what you’re kind of innovating on. And on the Rockset side. As far as the industry is concerned, I think, how should I have said, I’d say that real time is very addictive. For most applications. It’s like a real addiction, in my mind, like, what we have seen is that when customers use a real time system, they cannot go back to a stale analytic system, they can feel the difference of like testing sugar for the first time, right? If you think there’s some restaurant I went to the other day, I won’t name the country because I want to do or say bad things about the country. But that country did not have sugars till 1876 Or one till the English people went there. So then suddenly, everybody got addicted to sugar. But yeah, real time analytics is like that. I think most people are used to stealing analytics saying, Okay, I got to wait for one hour to figure out what to do next, how to make my business better. But once you taste, this just sticks with you. And I think a lot of applications, data applications, starting from your food delivery to your book, shipping, or whatever else, everything is more and more real time. Like data is just transforming everything that we have in the world, like data is pouring. I’m going, I don’t want to be pressured. But like data is the new while and all this stuff I keep hearing right? This is coming through, in my mind, a lot of automation being built on these datasets. And it’s not people making decisions anymore. It’s some other piece of software that’s making decisions about data. And this is what Rockset and real time kind of applications are driving more and more into this area.
Kostas Pardalis 48:55
Yep. 100%. All right, one last question from me, and then I’ll give the microphone back to Eric. So it looks to be like, I don’t know, like one of the most successful, like open source projects out there. Right. Like, it’s phenomenal. Like not just the use of it, but how much has been used to build other technologies. On top of that, like you mentioned, if you I mean, I think like a testament to this is like if someone goes to GitHub and see like wrongs to be there, yeah, you can see like, the 1000s or 10s of 1000s of like the stars, but what is like, so impressive, like how many clones exist, like how much has been cloned, right? Which means that like, people don’t like working on it. What in your opinion outside of, okay, obviously, like there’s something revolutionary about the technology itself, right, but what else do you think that contributed to the success of rocks DB as a technology outside I mean, doesn’t project and like the adoption and all that stuff outside of like, It’s okay, the technology itself.
Dhruba Borthakur 50:02
I think it’s the people and the funding, I think no software becomes useful unless there are good people working on it. And nothing happens in the world nowadays, by one or two people, you need a good set of great people to band together to build this software. That’s the first one, I think. And the second one is that you need some kind of support so that the community and software grows along with it. And I think Facebook provided that a lot, especially testing frameworks, especially leveraging many other systems that were there at Facebook, to make Ross better tooling to be able to figure out and find bugs quickly. A lot of those basically, again, I think that two things that are, I think, make a lot of these open source projects successful, right. One is the people, if you can assemble 20 great people to build a project. I think that’ll be a fantastic project. And then the next one is that is there a force behind this community so that it can move forward? I can see that happen in many other open source communities as well. I mean, Hadoop community, I still participate in it, but I don’t write Hadoop code much anymore. But I see that the community is very big there as well. So there’s so yeah, so I think open source is an interesting, you know, you are at Starbucks. So there’s a lot of open source development there as well. And it’s, I think the open source community kind of feeds with one another. So it’s kind of it’s a good cycle to kind of participate and make things better. So yeah, that’s the answer about rocks. TV.
Kostas Pardalis 51:33
Yeah, common presumes. Eric,
Eric Dodds 51:36
all yours? Yes. Okay. One. One philosophical question. And one practical question too, to close out the show. The philosophical question. So you mentioned, you know, more and more real time is having an impact, and they’re machines making decisions. You’ve seen this on a much closer level, from sort of a, you know, deep in the stack data perspective. Is there anything that worries you or maybe a better way to phrase the question is, do you have thoughts on how we steward this technology? You know, as we implement it, you know, because people have lots of opinions on machines making decisions, and the technology is obviously enabling that. So thoughts?
Dhruba Borthakur 52:23
I mean, I know like, Take, for example, all the drone systems that are out there, right, there is a good back system of real time analytics of what the effect of the drone stuff is when you like, put bombs somewhere and things like that, right. Yeah. So there is real time, I think, changing the world. Like, it’s like we kind of discussed, I think there is a fine balance between how to channel that thing for greater good. There’s always 10% of or some few percentage of usage, which are probably not the most ideal for humanity and jump right? It’s just like the atomic bomb, right? I think that produces a lot of energy. Now I’m bringing Adamic energy, but I think she will be able to leverage it. Well, I feel like we have lots of bells and whistles, a lot of our automatic processes that we build, always have fallback mechanisms, that word should be in a band, it doesn’t just go haywire and ruin everything. So I think the human mind is still at the top of the food chain. So before, before all these automatic decision making becomes life threatening or anything. So yeah, you’re sure.
Eric Dodds 53:36
I think it goes back to what you said about people. Right? Like, it’s the, you know, the right people behind the technology, you know, I think is the most important thing. So thank you for entertaining a somewhat ethical question. Okay, last question, which is more practical, you have seen such a really built on such a wide swath of data technologies, you know, even reaching back into the days of making architectural decisions based on hardware, which, I think, a decent proportion of our listeners, you know, sort of, to them that, you know, they won’t ever have to make those decisions, because, you know, Rockset has made those decisions for them in the cloud, right? And so they can just sort of scale without, you know, without thinking about it. And now you’re building for the cloud. When you mentor younger people in the data industry. What do you tell them, you know, sort of about how to think about their career and how to think about data because you bring such an immense amount of perspective of the history and so how do you package that into advice?
Dhruba Borthakur 54:43
One of the core philosophies, again, I think, is to add value to somebody’s life. I really don’t care whether it is monetizable or like something where you can make money about but I feel like if you can add value to somebody else’s life then automatically as a side effect, you make the ecosystem better, you probably make things better. But as if you’re starting off in a data carrier in the early parts of the carrier, I think the focus is always, how can I build something? Or how can I do something that adds additional value to somebody else? Someone else could be peers in your team, right? That’s a great thing as well. It could be customers if you’re selling stuff. Or it could be just plain users like open source software, right? There’s no customers, they’re mostly users. So as long as I think you were focusing on building value, I think you’ll get into this cycle of becoming more impactful yourself and enjoying the work at the same time. So the word doesn’t become work, work becomes more enjoyable, because you’re adding value, you see people liking your stuff, and you build more of it. Yeah, I will probably give you a meta answer. I mean, this is applicable to whatever industry you are, it doesn’t have to be software or anything else, whatever you’re doing. I feel like if you’re building value to somebody else’s work or somebody else’s product, or life, I think that is a great thing to be, as long as you’re also enjoying the work that you yourself are doing.
Eric Dodds 56:17
Yep. No, I think that’s not only wonderful career advice for people working in data, but just wonderful life advice. So Drew, thank you so much for that. And thank you for being so generous with your time. And amazing to talk with you, the builder of some, you know, some of the highest impact technologies that we see driving a lot of the things we use every day. So thank you so much for giving us some of your time.
Dhruba Borthakur 56:44
Hey, thank you. Thanks a lot, Eric. And it was really good chatting with both of you.
Eric Dodds 56:49
Well, I don’t think many people know that. You know, rocks DB was originally fueled by the smell coming from a Quiznos that was baking subs next to the Facebook office. But now you have the backstory brought to you exclusively on The Data Stack Show. Now think, no, it was great to hear him talk about, you know, Facebook’s sort of first office. And being in the basement. So many interesting things that we covered. I was just so impressed with how Dhruva just has maintained a, I would say like just a high level of interest in joy in the space and in building things. For he’s seen so much. I mean, it’s cool to hear the stories, but you have to imagine the day to day of trying to build that stuff and scale that stuff inside of an organization that’s growing like, you know, Facebook, you know, being a founder. I mean, those are really intense experiences. And he still seems just, you know, full of joy and energy. And that brought me a lot of joy. So I think that’s the main thing I’m gonna take from the episode. How about you?
Kostas Pardalis 58:06
There are many dates, but I’ll keep for sure, like one, which is how many you have, let’s say then, like, how much of the innovation that we have seen, like in software, is actually triggered by innovation in hardware. This is one of the insights that like, I don’t think that you can get from someone unless this person has been in these for a while and doing the stuff that we were talking about, like today, I think discussing about like storage and how like storage actually, like dictates like the things that we can do. And how these like warrants with Hadoop and how like then the SSD is like, broad like ropes to be like all that stuff. It just makes me I don’t know, like, I think I’ll be looking into what new storage technologies are going to be coming in the next like Monson yours, like with much more interest now than before. So I’ll give that. I’ll keep that. And let’s arrange to have some back soon, we will want to talk about
Eric Dodds 59:26
Well, thanks for listening. Many more great shows are coming up, subscribe if you haven’t, and we will catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.
Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
To keep up to date with our future episodes, subscribe to our podcast on Apple, Spotify, Google, or the player of your choice.
Get a monthly newsletter from The Data Stack Show team with a TL;DR of the previous month’s shows, a sneak peak at upcoming episodes, and curated links from Eric, John, & show guests. Follow on our Substack below.