Episode 102:

Building Pinot for Real-Time, Interactive User Analytics with Kishore Gopalakrishna of StarTree

August 31, 2022

This week on The Data Stack Show, Eric and Kostas chat with Kishore Gopalakrishna, the co-founder and CEO of StarTree. During the episode, Kishore discusses internal analytics versus user-facing analytics, new technology, and the data landscape.

Notes:

Highlights from this week’s conversation include:

Kishore’s background and career journey (2:30)
Internal analytics versus user-facing analytics (3:49)
New ways of thinking about analytics (8:06)
What makes Pinot different (13:45)
How Pinot transforms systems (21:53)
Understanding the data landscape (32:40)
The Pinot user experience (36:27)
Something exciting about StarTree (40:05)
When you should adopt this technology (43:15)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 0:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Welcome to The Data Stack Show. Kostas, today I’m so much stuck to our guests, because it’s a subject that I actually don’t think we’ve covered before, which is user-facing analytics. One of the creators of Apache Pinot developed that it was a technology that came out of LinkedIn that really drove a lot of their user-facing analytics. And I’m just excited to chat about that because we have new subjects. But I, one of the questions I have is, we talk so much about analytics in terms of whatever reporting inside of a business, KPIs, etc, developing analytics for users executive team marketing, etc. And I want to know if there’s a big difference in the way that you think about user-facing analytics, and developing that, and then, of course, like the technology around it, because obviously, they developed Pinot to deliver those in the context of LinkedIn. So that’s what I want to ask about. How about you?

Yeah, I’m very like excited about our show today, because we are touching like a category like a subcategory of like analytical databases that we haven’t talked like in the past. And these like real-time OLAP databases. And we know is one of them are like a couple of others. But it’s the first time that we are going to discuss about that. And I’d love to learn more about what makes them unique, both from like the use case perspective, because what you mentioned is one of the use cases that’s uniquely serve these systems, but also from the technology perspective, like what makes be no different than BigQuery, or Snowflake, for example. So yeah, I’m really looking forward to as with our guests today.

All right. Well, let’s dive in.

Kishore, welcome to The Data Stack Show. We’re so excited to chat with you.

Kishore Gopalakrishna 2:09
Thank you. Thank you for having me. I’m so excited to be here.

Eric Dodds 2:13
All right. Well, could you just give us a background? Sort of your work history and how you got into data? And then what you’re doing today at StarTree?

Kishore Gopalakrishna 2:24
Yeah, absolutely. So maybe I can go backward and start from what I’m doing today. I’m Kishore. I’m the CEO and co-founder of StarTree. Prior to that I authored co-authored Apache Pinot at LinkedIn. And before that, I build a lot of distributed systems. That’s kind of what got me excited to the world of distributed systems where it’s fascinating to see one node going down kind of renders other nodes you completely useless or the entire system useless. That kind of caught me Julius, again, with a lot of distributed systems. And over in my career, espresso was one of the things which is similar to MongoDB, which is the document store. And then I also built Apache helix, which is the cluster management framework, which is used to build other distributed systems. And then Apache third is the anomaly detection and root cause analysis. And of course, Apache pinion, which is built on top of it.

Eric Dodds 3:18
Yeah, amazing, what a, what a resume of work. I almost don’t know where to start. But I think one thing that would be helpful is, why don’t we start by talking about sort of internal analytics versus user-facing analytics, because that’s a really important paradigm, for Pinot, and of course, StarTree, could you explain sort of the main differences between internal analytics and user-facing analytics?

Kishore Gopalakrishna 3:50
Yeah, absolutely. I think that I’ll probably start off with before getting to the low-level details of that. Just the concept of user-facing analytics. I mean, I was really not familiar with this term. But back at LinkedIn, when we actually will be— I was building this other system, which is espresso, which was user-facing, but it was the OLTP workload, right, which is solving all the transactional workloads. But then we linked the this really embarked on this new thing, which is, hey, we have all this data that we are collecting from the users, right? And how can we actually provide insights back to them, like all the LinkedIn members who are visiting the visiting the LinkedIn website, and that’s kind of where user-facing analytics really originated? And for most of you, you’re familiar with who viewed my profile, right? Where you go to LinkedIn, you see your page, and then you see like, these are the people who actually viewed they’re from X, Y, Z companies. They’re from X, Y, Z skill set. Here is the geo. So it was a very interesting app that we actually started off. And this was built, not brought on. It was actually a prior version, which was built on a search engine. That’s kind of where everything originated. So that’s kind of really the concept of user using analytics, it’s not from the use case point of view, it’s very similar. You have the data that you’re collecting already. The internal analytics is about surfacing those insights to your internal employees within the company, be it analyst operators, even engineers or CEOs. And exactly, whereas user best thing is really taking that outside the organization to which is to your customers, to your partners, and then providing them with all the insights so that they can actually make better decisions, right? They can have the classic example if you kind of go out outside of LinkedIn, which is on Uber, HBase. So now, you can think of a restaurant owner as a providing analytics to him as a user-facing. Like all the orders that is happening on Uber Eats is coming into Pinot. And now that data is being surfaced in as in real time to a restaurant owner. Now he knows like, they watch my revenue, what’s my wait time? How long am I taking to process this order? So this is this whole thing of providing insights directly to the end user? Who is actually the one who is making these micro-decisions that page? Should I actually bring in another person to help me so that the other way time actually reduces? Because that’s directly impacting my business. So what used to happen via support before were reports that were sent, like once a month or once a day now is being directly surfaced to the end users via interactive apps. So that’s kind of really the changes that we saw with the phishing analytics.

Eric Dodds 6:33
Interesting. So yeah, so I actually, that makes no sense in terms of like the aggregate of like, Hey, here’s your weekly report on who viewed your profile, right, as opposed to logging into the app and saying instantly, here’s who’s reviewed your profile maybe since the last time you logged in, one thing that I’d love to, to get your perspective on is how you, are there helpful ways of thinking about internal analytics versus user-facing analytics, and I’ll just give you one example. So a lot of times internal analytics are aggregated, right? So you want to know monthly active users, you want to know whatever trending on revenue, or margin or usage or activity, whatever it is. What’s interesting about the example you gave, for LinkedIn around like, Okay, well, I log in, and you sort of instantly deliver this, here’s who viewed your profile, here’s their Geo, whatever, is that it’s almost like, each user gets their own filtered pivot of data, right? From like, what internally would be a large data set, right? So it’s like, Okay, how many people on average are viewing these sorts of profiles, that’s more of like, an internal KPI for a product manager or something like that. But the user-facing analytics, at least in that regard, with that example, are, like pretty specific to that user. So as you embarked on this project at LinkedIn, were there sort of new ways of thinking about analytics that you had to master in order to build Pinot?

Kishore Gopalakrishna 8:19
Yeah, absolutely, I think that you brought up a very important point, which is really about providing analytics for now in a personalized fashion. It’s really geared towards that one person who is actually looking at it, what’s the view that your ESC will actually be excited? So it is, to certain extent, personalized analytics, right. But there are other use cases that can be aggregated, even at a different level. So if you kind of look at it at an operator level, on your on your customer site, so if you kind of look at it at a partner, talent Insights is another example that I can actually, which is, think about yourself as a company in Amazon or Google, I can you are now seeing like, Hey, where are my people living? Where are they going to? Are they going to Facebook? Which area of Facebook? Are they going on which area of this XYZ company they’re going and which skill set I’m actually losing it losing people here are where am I gaining more people? So this kind of inflow and outflow of talent is an aggregate metric. But now it is only per customer, right? So LinkedIn provides talent insights, as a product to Amazon, Facebook, and all these now, if you kind of go back before this world was basically reports based, so every month end of the month, someone internally in LinkedIn would run a Hadoop job or a Spark job, and then send all this report to them and saying, okay, here is the report that you asked for here, the number of people, but now it has become interactive, right. So now it is that report has turned into a data product. So that’s the mindset that I generally see with what happens with user-facing is, hey, we were generating all these reports and sending for the user. Can we actually flip this report? integration into a data product. So that we directly give these in our data and insights in the hands of the customers, right? So they can actually make use of it the way they want to, they can slice and dice however they want to. And they are directly generating insights, instead of them telling it this is the report I want. Can you vote on this report and then coming back, so that’s like the old way of solving this problem. So in large extent, it’s the problem is the same. It’s just that the way you think about the solving the problem is what is changing. And that’s where came in. So LinkedIn kind of changed the way they were trying to solve this. And then Pinott was born as a result of that, because once we tried to solve, solve that, we hit all sorts of challenges, right? I think that’s something that I didn’t mention in the previous topic that you brought up was internal versus external is like, these three dimensions, which is, what is the freshness of the data? How fresh is the one that I’m seeing the data that I’m seeing? The second one is latency. Can I process this data? Can I ask questions at the speed of my thought, right? Like, I don’t want to ask the user ask a question and then go for a coffee break and come back. And then so that’s those are all the things that happens in the internal world, which is, you run a query takes a minute or two to actually run out. And it’s Batchi internal analytics. But in user you want user-facing real-time. And very, very low latency, it’s in range should be interactive. And the third dimension that I want to bring up is the conference. Right? Like internal? Typically, that’s very few people who are actually accessing this dashboard. it because it’s always bounded by the number of employees in your company. What when you grow out? It’s limitless. It’s come in, depending on how many users and how many, like LinkedIn has like hundreds of millions of users and continues to keep growing. Uber each has like hundreds and 1,000s of restaurants. And that keeps growing. So now you are seeing orders of magnitude changes in both the latency freshness, and the concurrency challenges. That’s kind of why the main reason why we built because we just couldn’t scale it with the old system that we had.

Eric Dodds 12:08
Super interesting. Yeah. And that makes total sense when you think about the way you think about internal users. Like, maybe someone’s looking at a report a couple of times a day, or once a day. When you think about 50 million people logging in and needing to provide them with a piece of data in a couple, I don’t know how long it takes a couple of milliseconds or something. Yeah, that’s why Okay, my next question, I’m actually going to hand it off to Kostas because I could keep going here. But I want to know how you actually did that and learn more about Pinot, but Kostas is way better to ask those questions. And my guess is that’s what he’s going to ask about. So Kostas, please take it away.

Kostas Pardalis 12:47
Yeah. So you might have like to repeat yourself a little bit because like, I think the question, but I’m going to ask you is probably going to overlap a bit with the stuff that you discussed with Eric, but it is good to put some things like under the right perspective, and I was checking les websites or fino for example. And like, if you try to like to figure out what Pinot is, like, the first thing that you will see is that it’s a distributed the relational OLAP database system. And I’d like to ask you, as an OLAP system, like how it is different compared to something like Snowflake, or Redshift or BigQuery, right? Because this is also very typical example of an OLAP. System. So what makes Pinot different? And most importantly, why we need to architect differently all up systems and create something like Pinot?

Kishore Gopalakrishna 13:44
It’s a great question. I think this is one one of the things that it’s kind of very hard to just get you to get 10,000 food, like, it’s very hard to say like, Hey, these are all analytical system, at the end of the day, you’re asking SQL queries, and then you’re answering, right, so you can kind of even bubble it even more until at all DB systems also support analytics, right? Like, why can’t I use Postgres? Why can’t I use MySQL? Why can’t I use any of this NoSQL document stores, right? Because at the end of the day, they’re all having data. And they all support SQL queries, right? I think where the differences start coming up is like the workload, the kind of workload and the kinds of use cases that are in power, right? You definitely can power like a very, very high throughput analytical use cases on an OLTP system. Because the way the data is organized internal edits are all row-oriented. And it’s kind of very well known that you need to now shift from the row-oriented structure to the column-oriented because analytical queries is not elke lookup. Unlike the WLTP system because you need to scan a lot of different data. Now you come to the old analytic side. Now you have this very idea of systems, like as you mentioned, the data warehouses of the world big queries, the Snowflake and Redshift and that That’s right. And now then you have this OLAP systems, which are kind of in the origin, little legacy systems, people used to think of OLAP as this cube division, right? OLAP on Dwolla cube were very synonymous, right? And the data concept of data Mart’s and the data warehouse existed before. So it’s kind of very similar in the sense, but some things are changed in the last decade. But I think what used to happen with hola hola was at some point, very, very popular. If you had the call, the call the things from SAP and Hana and others were IBM, Cognos, there’s a lot of these OLAP cube systems. So it’s important to know like, the way that things have changed and how they were solved before, right in before, what used to happen was all these cubes would be computed with every question that you want, you want to ask a how many people from us and Chrome on Windows actually looked at my ad? Right? So that was the question and this, this answer will be pre-computed on an ETL job, and then it would be pushed into an OLAP system, right? And then the answer will be very, very fast. The reason why it really broke down is two things. One, as it was, there was a huge explosion in storage, to answer the compute all these things, it was very, very expensive. And two, it was inflexible, you move it you add one that one more dimension, everything is gone, like all your previous computed on proceed wrong. So to all cubing technology that never really took off because of that, because there was a lot of it wasn’t the technology was not really made for this. Whereas the data warehouse was really good for internal analytics, the analyst could ask like, a very complex join, which is like 100, tables join and things like that. And but the query would take a long time, right? You couldn’t really take a data warehouse and then put it to your end users. That’s not what they’re really good for. They’re generally about throughput, not about the latency, like how much data I can actually scan in, answer this question, and how quickly can I actually scan. So they’re, their roles were actually slightly different compared to the OLAP versus the data warehouse. So that’s really the three systems in the old TP system that the data warehouse, and it’s the OLAP, and Theano, and Droid and ClickHouse, I think these are all the three systems, which kind of revolutionized the way OLAP is stored, will not be solved, right. And we kind of went, especially the rise of SSD, I think, which helped us not to pre-compute all these things upfront. So we could just keep the raw data as it is, you don’t have to aggregate it. And you can actually do the aggregation on the fly. So really, the challenge is we do the processing beforehand, during or during the week, it’s on the fly computation was the speed computation. That’s kind of the challenge that OLAP always had added with the rise of SSDs. And other techniques like sim, D, columnar and indexing, you could actually say like, Hey, I will learn compute all these things on the fly and still be able to give you like, really, really good research. And that’s kind of how this world has evolved. For now, you can think of OLAP as like an acceleration layer, in some cases on top of the data warehouses. But you can also think of it because of the real-time nature, it’s no longer sitting on top of the data warehouses, which is like a bad source, but it’s directly sitting on top of streaming sources like Kafka, right? So that’s where the freshness attribute comes in, which is the I can’t wait for you someone to take the data in Kafka and other messaging systems, do an ETL loaded into data warehouse, and then get it into the OLAP bridge. So that getting that layer is getting shorter today, right now, there is like point at your streaming source, the moment that data event is created, we can ingest it, and then we can directly serve it today. So I think there are a lot of these changes a lot of variables that changed in the world that kind of resulted in creation of these systems, aren’t it? Just the nice thing is it empowered the product, product guys and everyone else in the company to think of things that they had not thought about before? Like, if you look at LinkedIn, it’s serving 200,000 queries per second on keynote today, right? It’s huge. It’s like almost taking OLAP workload, and you are actually sorry, OLAP system, and you’re actually serving OLTP kind of workload. Yeah. And if you look at internal analytics, it’s hardly single-digit queries per second, right across the entire company. So it’s a day and night difference in terms of what you can accomplish with something like Pinot, what’s the something like the internal data warehouses? I’m sorry that that was a long answer. But I think it’s these dimensions actually matter a lot when you actually deep dive into what are the use cases and what employ empowers and enables a company to achieve.

Kostas Pardalis 19:53
100% and I think like it’s important to have like these conversations and put things like under the right perspective. because even I think like for data engineers out there, it might be hard, like, sometimes like to understand, why are they what are the differences in why they might need like, not just like a data warehouse, but also like real-time OLAP system and vice versa, right? Like it’s, we take it but people that have been like working with TSA for a long time take like, all these things like for granted, but there are nodes, right, like, so that’s why I said and I was expecting, that’s going to be like some kind of overlap with because you talked about like freshness, the latency, the concurrency, like all these things that. I remember when I first used Redshift. And like, you could see that like, Redshift was back then was supporting like, I don’t know, like 100 concurrent queries, maximum or something like that, right. And then you had started using like, queues and things like this to, to do your analytics, which made total sense because it was just system but it’s built, like, as you said, like, go through a lot of data. Many queries that take a long time, like to compute what Okay, how many users you’re going to have. So it’s always trade those that’s the, at the end, they give like a different system as an output. And that’s what I’m trying like to emphasize here. Because this trade-offs are like very, very important for us, like engineers like to understand.

Okay. So what’s the secret search? How do you take a system like a data warehouse system as we were used and turn it into something that it’s all up but has also the performance of let’s say you’re four Well, TP. So how does Pinot do this?

Kishore Gopalakrishna 21:41
That’s a great segue into getting into the internals, I think. But let me try to start from the high level, and then just deep dive into, like, what happens? So I think let’s take a query, for example, like what are the phases in the query, like when someone runs a query on a database. So typically, a query has a filter step, which is that you have a predicate in the query, and then you try to call the filtering part. And the second one is like the scanning part after that, once I’ve filtered and narrowed it down and like, okay, now I need to scan it by a bunch of data. And then the third part is really the aggregation or the group by like, what kind of operation are doing on top of it. So you got to break down most of the query into these three phases. Again, there is a join and other things that come on the top, but I’m gonna keep it simple for this. Right, right. It’s really filter, project and aggregate and group by so that those are the three phases. Now, if you kind of look at how data warehouses did this, they just on the filter phase, it will always be brute force, that will, I’m going to scan as fast as they can, and I’m going to filter it right. And then yes, they would have some additional metadata that they would try to keep, like min/max, and then Bloom filters and things like that. But they were all at a high level, sparse index is what I would call predict is, can I eliminate, like some chunks of data. But what we found was that was not scalable, because there is very hard for someone to actually say that they productively saying that, okay, I might filter is going to take x number of seconds, for example, of milliseconds. And for us, it was very, very important to have that predictable latency. Like if you are looking at who viewed my profile page, we have a 99th percentile latency, which is less than 100 milliseconds. So we would get alerted if something goes beyond that. Right. So we had to make sure that we put in all these things to solve that to address that latency requirements. So that’s where indexing comes into picture. So now you go from the scan-based filtering, to actually index-based filtering, right. And that’s where the notion shines a lot. So it does two things. One is it rearranges the data automatically, so that you get memory locality first, because in terms of personalized Analytics, you can automatically rearrange the data internally to make sure that all your profile views are stored together in the same place so that in one seat, I can actually get all your profile views, instead of just staggered, and then I will do like a lot of different things. So that’s the first optimization that we do is we automatically reorganize the data as they are coming in right, which is a very, very important thing for us to make use of the see the amount of seats and minimize the number of seats that we need to have. And the second one is the indexing, right? We go from like sparse indexing or like block level and like chunk level indexing to get into like row level indexing very, very fine grain so I can exactly tell these are the rows where your profile views are placed. These are the segments in case of where we can actually get into. So it’s really about that filtering phase, right? Like how efficiently Can you prune down a lot of data so that you can, you don’t have to do the work. So the philosophy you kind of do Look at this is the data warehouses and all these other databases try to do the same work, but pastor BI trying to use sim D and all these other techniques. But the philosophy of fino is like, Hey, can we not do the work? Right? So that’s kind of where the indexing comes in, it’s about eliminating the work rather than doing the same work faster, that that’s the most important one in the filtering. Now, that’s where indexes have tons of indexes, right? Like it for every use case is a week kind of thing that you can think of, if it is arranged indexes, if you have the indexes, you can say like, tell me all the queries that are more than three seconds right? Now, where classic inverted index will not work for that, but you have something called as a ranger, which quickly knows, like, Hey, these are the clothes that actually have more than three-second latency. You have an inverted index, which is classic, which is probably something common to all the other system, we have text index, so you can index text, you have JSON index, so you can actually do that we have geo index. So there are tons of indexes like this. And that’s the that’s kind of how we architected it is like the indexing schemes itself is pluggable. So you can keep adding more and more indexes when we have added a lot of indexes over a period of time. And then we continue to keep adding more and more. So that’s kind of the innovation that we did on both the storage layer and on the indexing layer. And then the next part is the aggregation, right, in terms of the organ, even after the filtering, how can we make sure that degradation is faster. So there again, there was classic techniques of making the scan very fast. I introduced another technique called a starting index, which is kind of how we named our company as well, based on that indexing technique. Because it makes things so much faster, or result in multiple orders of magnitude. So just to give you an example, right, let’s say, the classic cases, how many ads did we show in us? Let’s say someone asked that. And typically what happens is, in any company, you’re 50% of your ads are probably coming from US data compared to the rest of the world. So even if your index, you have the indexing technique, you will have to end up scanning 50% of your data, right? What is the revenue for us? So that’s where start will comes into picture. So start the analyzes this data upfront. And then it automatically figures out that, hey, if a question comes for us, it’s going to be very expensive if a query comes, so I’m going to pre-compute only for us. But if it comes for something like Kenya, which is you don’t really have to pre-compute, it can actually do on the fly computation. So it’s going to do on the fly competition for Kenya. So it has this smartness built in to actually figure out what needs to be aggregated and what can be aggregated on the fly. So you don’t have to explicitly say that the aggregate for us aggregate don’t aggregate for Kenya, it has the it provides the data as they come in, and it creates these smart index is on top of it. So it’s like this enhancement and optimum innovation at every layer, from storage, to indexing to aggregation. And even there is a lot of pruning that happens in the brokerlink How can we minimize how can we eliminate the work so it’s across all these stacks that we have actually done the optimization that actually gives us the speed. So there is actually a nice, nice blog on like, what makes me know paths? And we have listed all these different techniques.

Kostas Pardalis 28:28
Okay, that’s awesome. Okay, follow up question to that because it sounds like okay, almost like too good to be true in a way, right? Like, why do we need Snowflake if we can have like something that’s like as fast as oil, oil TP and at the same time is an OLAP system. So what are the trade-offs? Because when we are engineering, we always make trade-offs, right? So what do we lose by adopting blockchain using like a solution like be not compared to the traditional OLAP system like, like Snowflake?

Kishore Gopalakrishna 29:00
I think there are two things that one is the flexibility of query, definitely is not built as a data warehouse. Right when you are not, the pillar is not going to support like 100 V joint. I mean, we are right, we don’t have joints at work. Right now, we have lookup joints, and then we are adding the joints as well. But again, the ability the key thing for us is to make even join pastor. So we are going to address the cases, which is where joints can also be predictable and very, very fast. So we have some innovative things that we are actually adding on to the join layer that will help us even solve some of these joint use cases. But again, keeping in mind very, very low latency, right and data warehouses are really made for the analysts who can actually write 1000 lines sequel, right? I cannot really pass that but that’s the purpose for data warehouse. So you have a lot of different tables that needs to be done. So it’s definitely not built for that and we don’t intend to go there as well. We need to we want to make sure that we solve one use case and talk about that One. And the second one is the cost aspect, right? I think it’s, so you need to have a certain level of usage for this too, because one of the things which is making this is changing, but it’s a tightly coupled system. So that means you have the up the cost for storage and compute, because the compute is always running late. Unlike Snowflake, you can actually bring up the compute when you want when you’re running the query. And only when you’re doing it you access, you pay for that, which is good. It serves a particular purpose. But you can’t do that for user-facing Analytics, you cannot say that a user came to my COVID, my profile page, now I’m going to spin up, spin up the compute to actually answer there. So we did it that break point. So you typically, once you get to like hundreds, 10s, and hundreds of queries per second, you end up actually needing something like because the other side becomes a lot more expensive, because you’re paying per car per query, right? So you want to really look at, if you see the TOS two systems, it’s like the cost per query, the cost per query, in Pinot, will be hundreds of magnitude lower than the Snowflake. But only after a particular skill, that you need to have that level of concurrency, you’re serving your apps, you’re serving your users. And then you start having that human point at which Snowflake and other systems will become super expensive. And the key thing to realize there is it’s kind of a chicken and egg problem. So sometimes you try to build your user-facing apps on these data warehouses, and the latency is bad, you won’t be able to add also, you won’t have the freshness, you won’t have the dimensionality that you need for your app. And then your app never takes off. Like when you say, okay, I’m okay with this. But the key thing is like, make sure that the users get what they want. And then you will see the concurrency, the number of requests actually go up a lot. And that’s kind of what we see. Like, as I said, like LinkedIn is solving hundreds of 1000s of queries per second on top of it. That’s because you provide an app to the end user, right? That you’re not, you’re not writing SQL queries against this, a restaurant owner is not writing SQL queries, it’s, it’s very different in terms of the way people interact with something like Trino versus the data warehouse system. So you always want to think of in terms of apps, like What apps do I build on top of how do I actually showcase the value to the end users? And that should always be to add, and not providing them with a SQL Editor and say, like, Okay, what I do SQL query, and that will never, that will never work. And then that’s also probably the wrong way of using something as powerful as BEAM.

Kostas Pardalis 32:37
Okay. Yeah, makes total sense. So you mentioned before to other vendors outside of like Santeria, like Pinots also like ClickHouse and Drewery. Right. And I think Druid is probably the oldest one or when am I wrong on that?

Kishore Gopalakrishna 32:51
I think, yeah, you’re right to it. It’s probably an old one, I would say. I don’t know, Elasticsearch was probably there. I think it’s also something that people use for analytics. Although it was not purpose-built for that it was really built as a search engine. But once you have one mean, you have an inverted index, you can throw in anything like that. So a lot of people ended up using elastic search as well. Yeah, yeah.

Kostas Pardalis 33:13
So what’s the difference between these three athletes, like different tools, give us like a little bit of help us understand the landscape? Like understand, like the vendors there? And what are the difference?

Kishore Gopalakrishna 33:26
Yeah, I think the key thing for us to look at is evolution of analytics itself. And then it kind of becomes apparent on where shines versus the other system. So the world went from, like, your batch analytics, to real-time analytics. So that’s where do it came in. And I think the courts added a little bit of the Kafka connectors and other things later on. But it was really going from batch to real-time is where ecosystems can make and B noise coming is weighing from real-time, low concurrency to external with high concurrency and predictable low latency. So it was kind of the freshness was the first factor which drove through it and ClickHouse In the beginning, and then latency seconds were completely fine. We are actually going from, we still need the real-time that says they will that has become a table stake right now. But now you are adding other dimensions, which is it has to be millisecond response time, and be able to serve very, very high concurrent requests, and then be able to be predictable about that. So it’s really when you draw the graph of latency versus concurrency is when you start seeing the differences between these systems. So Pinot is able to keep that low latency as the workload increases. And whereas the other two systems we have not really made for that, in terms of the low-level system design, as you, as I mentioned about the fine grain indexing. ClickHouse still has path indexes, it doesn’t have the fine-grained indexes that Dino has seen. We do it do it has only one level of indexing, which is the inverted index. It doesn’t have all the other indexes that we mentioned, it doesn’t have the concept of start being that person, but there are others. So it’s kind of, we are kind of very, very focused on providing predictable, low latency at any scale. Whereas with druid and ClickHouse, the main purpose was to provide low latency for the internal use cases. That’s kind of where we’re really started. And that was the premise behind the design.

Kostas Pardalis 35:30
Okay, yeah, that makes sense. And that’s actually interesting, like the differentiation there between like, the internal and external, like use cases, because, yeah, concurrency is important there, like you need to have, like really high comparisons together with like, the low latency like to do that.

Okay, so let’s go back a little bit to the technical conversation that we had about indices and like indexes, and like all that stuff. So what’s the experience of like the user have has like we setting up Pinot right now like, Okay, you give like all these options in terms of like the indexes that you can use, you also have like a smart indexing system with like the StarTree that you mentioned there. How much the user is like, responsible for making Pinot as fast as it can be? And how much of this is automated?

Kishore Gopalakrishna 36:26
We would definitely want to take these increases, right? Like we don’t want to become try to become too smart bear in terms of figuring out the indexes automatically. So our first option was to always have no, I mean, this has been what databases have done all through the world, right? Like decades, like, you have an alter table. And you can add indexes. But I think what we took on is like, these are all the features that we have, here are all the knobs, you control, you have the control in terms of what are you trading of, because adding an index comes with the cost data, you’re adding a little bit of storage overhead. But now you’re trading that awkward, I’m going to get like, amazing performance. So we kind of have this complete flexibility, that kind of where we went with like, each column can actually be of any type, it can be encoded in any way. It can be raw encoded, it can be dictionary, it can be run length, you have all these options that we give to the user. And each column can have any type of indexes, it can have inverted index, it can have range index, it can have JSON index, whatever that you want. And even the type is very flexible. So it can be strict type, it can be long, and etc. It can also be semi-structured JSON. So now you have indexes on the JSON as well. Or it can even go further. And it can actually be text index read, it’s like completely unstructured. So you can go from common structure to semi-structured to unstructured. So you kind of need that whole flexibility to the user. Now, again, that now comes with an onus on the user, really oh my god, what what, what, what indexes to configure? Yep, so the way we approach that problem is really about giving them the insights when they run a query. So they can, the way we advise is like, don’t try to configure anything, because everything can be changed dynamically when the system is running, because this is something, how we built and operated the system at LinkedIn. So all of these indexes that we talked about can be added dynamically without having to re-ingest the data. So you can just ingest the data as it is. Don’t worry about the performance in the question, run the query. And then we, unlike traditional databases, which provide a explain query plan, and say, like, okay, these are the things that are actually slow, we actually embed that in every query. So when you run a query, you get the response, you will know exactly how much time was spent, in which case of the query? Is it the aggregation phase? Is it the filter phase? Is it a project phase? And then based on that, you can now go back and say like, Okay, I’m spending a lot more time in filtering, what column? Why am I spending time on that? And then what index is that I can actually, so it’s more reactive than trying to be smart about it. But I think these are some of the things that we plan to make it easier with the start B version of Python, where we will constantly keep analyzing the queries. And then behind the scenes, we will say, Okay, this is what needs to happen for this. And we automatically set up those indexes so that users can actually automatically start seeing better performances. So is that the learning and application of those indexes is kind of outside of outside of Theano. Because we don’t want to do too many things automatically. It just confuses the user.

Kostas Pardalis 39:43
Yeah, of course. Okay, it’s time for me to give the microphone back to do Eric, one last question. Share with us something exciting about StarTree that is going to come like in the next I don’t know like couple of like weeks A month or whatever, like, what’s the next exciting things that you are very excited about it?

Kishore Gopalakrishna 40:05
Yeah, I think two things. One is definitely that he had storage, which is, which is a big design change, it is not easy to change as designing systems that are five to six years old, right? So we, if you look at all these systems, they are tightly coupled in order with ClickHouse, the storage and compute is tightly coupled. And now, we actually announced a decoupled version. That’s right. So now you can delay what people’s with like one system, you can select these tables to be local, these tables should be remote. So then now you’re because one of the things that people users were asking is, okay, we know is too fast. How can I keep the data much longer. And I don’t care too much about the latency for the old data. But I want like very, very fast latency for the recent data. But our we want to take this data from one system to another system, and then keep moving these things around like can you actually make it fast in the same system. So that’s kind of where the tiered storage comes in. So you can say, for seven days, I want the data to be local. And as soon as the data is older than seven days, do it decoupled, so that it goes into S3, and the queries run directly on top of it. So that’s something that we are super excited about. So now you can keep the data in Theano from whichever longer you want. And then when you have like a data release coming for the older data, it’s going to be slightly slower, instead of like 10s of milliseconds is going to be hundreds of milliseconds. But that’s acceptable. But you’re now able to trade off between the costume latency so that the ability to trade-off between cost and latency was not available before. And now we are. Now we do that another level of flexibility to the user. In terms of picking one versus the other. And the second one is join. We are in very early stages. But we are coming up with some really cool ways of doing joins, unlike the other systems, in terms of being able to do joins in a streaming manner. They learn that that’s something that’s going to help us in terms of addressing some of the user-facing analytics use cases where you can you just need a lookup joint between the two or you want like one state join between two tables. So that’s something that we are also super excited about.

Kostas Pardalis 42:18
Super cool. All right, Eric. All yours.

Eric Dodds 42:22
All right, one more question. Since we’re close to the buzzer here. I’m interested to know, I’m just thinking about our listeners out there who are listening to you talk about user-facing analytics and speed and all these other things, but are wondering, when should I adopt a technology like Pinot and StarTree? Like, what like what point are what are the signals in your mind that indicate okay, you should sort of look at this, are they related to scale? Are they related to sort of complexity when in the lifecycle, say of like, the growth of an organization? What are the indicators that this technology is appropriate and to implement?

Kishore Gopalakrishna 43:11
Yeah, no, I think that that’s a good question. I think two years back, my answer would have been very different. I would have said I do until you grow like LinkedIn, don’t use LinkedIn or Uber. You don’t need to, but that’s changed drastically in the last two years, I think there is a lot of value that companies actually have unlocked. Sorry, that is hidden in the value data that they’re collecting. But that’s not unlocked to their end users, right? Like, look at all the small startup companies that are there. And they’re collecting huge amount of data from their customers, especially the SaaS vendors, right? And now, what are they doing with the data, they cannot just steal it, they give me all your data, and I’m not giving anything back to you. Right? So now they’re forced to think about like, Okay, what is more value that I can actually provide on the data that I’m collecting? So they’re building some sort of a data product on? So we did really start from there. It’s not about the size of the data. It’s not about the complexity of, of the queries, or any of those. It’s really about what more value can we extract from the data that we are collecting, so that it’s very startling, what more better decisions can we allow? Like recently, even this is a case of Cisco, which is kind of big, but even if you look at Cisco, the call that is that we are have that is happening right now. How long did you talk? How long did I talk? That’s a useful analytics to have readily labrosse All your podcasts that you have done, like it would be so cool to get analytics on top of that daily. And what was the tone how, how long did each one talk? All sorts of analytics is actually very useful. So it’s really about, hey, we have all this data. We have been for decades thinking about providing insights to our internal employees within the Arctic. How can we do How can we maybe do something better than that? Right? Like, how can we give it to the people who can actually make use of this data and then come up with make better decisions in their life? So that’s kind of the mindset, I would start off with this. And once you start seeing the patterns, it’s not that hard to actually find out that there is there’s so much more value that we can extract out of this data. That’s kind of vivid, I will start with.

Eric Dodds 45:25
That’s great. Yeah. I love that answer because I think it’s a good example of how technology that’s built for sort of super scale enterprise, if you want to call it that trickles down and becomes democratized in a way for lots of companies to use. So, super exciting. This has been such a wonderful conversation, for sure. Thank you for joining us, and giving us some of your time and teaching us about APinot and StarTree.

Kishore Gopalakrishna 45:52
Thank you, Eric. It was really good to have you on this call. And then I completely enjoyed it. Thank you for having me, again.

Eric Dodds 46:00
I think the thing that I really took away from the challenges related to user-facing Analytics was just the concept of, let’s say 50 million people doing something at the same time, and then needing to basically be delivered some sort of result from a computation. How many people viewed your profile, etc. And it’s just wild to think about that level of scale. With that little latency, it’s really just insane. So it sounds good, really, like, really challenging problem they saw? And Pinot seems like a pretty amazing technology to do that. But yeah, that’s kind of not the type of problem that your face every day.

Kostas Pardalis 46:50
Yeah, I really enjoyed, like the conversations that we had, because it’s one of these cases where you have, let’s say, a use case drive some extreme optimization of a system that almost leads like to a new design, to new design principles of how like, database system. So it was very interesting, like to discuss and like hear all that stuff about, like, how you index data? How do you store the data to make them like the access to that data, like faster? How you encode the data to do that, like all these like optimization techniques that have to be applied in order like to enable something similar to what we just described? And most importantly, what the trade-offs are, right, like, because obviously, there are trade-offs there. So yeah, it was like a very, very interesting conversation. And I hope we’ll have the opportunity to repeat it in the future, because there’s a lot that we haven’t covered today.

Eric Dodds 47:49
I totally agree. Well, thanks for joining us again, and we will catch you on the next Data Stack Show.

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 102:

Building Pinot for Real-Time, Interactive User Analytics with Kishore Gopalakrishna of StarTree

August 31, 2022

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter