Episode 181:

OLAP Engines and the Next Generation of Business Intelligence with Mike Driscoll of Rill Data

March 13, 2024

This week on The Data Stack Show, Eric and Kostas chat with Mike Driscoll, the CEO of Rill Data. During the episode, Mike recounts his journey from the Human Genome Project to developing the Druid engine, which was created to handle massive advertising data. He discusses Druid’s adoption by major companies and its evolution, emphasizing the importance of speed, simplicity, and scalability in data tools. The dialogue covers the progression of BI tools, the role of object stores, and the integration of AI in data technology. Mike also touches on the significance of SQL and AI’s influence on data visualization, what he would do if he wasn’t working in data, and more.

Notes:

Highlights from this week’s conversation include:

Michael’s background and journey in data (0:33)
The origin story of Druid (2:39)
Experiences and growth in Data (8:08)
Druid’s evolution (21:46)
Druid’s architectural decisions (26:32)
The user experience (30:06)
The developer experience (35:14)
The evolution of BI tools (40:55)
Data architecture and integration (47:53)
AI’s impact on BI (52:26)
What would Mike be doing if he didn’t work in data? (56:27)
Final thoughts and takeaways (57:02)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:03
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. We are here with Michael Driscoll from Rill Data. Michael, thank you so much for joining us on the show today.

Mike Driscoll 00:31
Great to be here, Eric.

Eric Dodds 00:32
All right, well, give us your brief background. How did you originally get into data? And what are you doing at Rill today?

Mike Driscoll 00:39
Yeah, thanks. My background is actually probably not that dissimilar from a few of your guests you’ve had over the years, I actually started my career as a software developer, working for the Human Genome Project, a couple of decades back. And naturally, there’s a lot of data in the Human Genome Project. And that was really the beginning of a multi decade, love affair working with data at scale heterogeneous data. And since then, I’ve started a few companies. My first startup was an E tailor called Custom inc.com. We sell T-shirts on the internet. I later started a consultancy called data sporer, we did a lot of consultant work for banks and folks in the big data era, I then went on to start a company called meta markets, which was acquired by Snapchat, or snap the makers of Snapchat, that did analytics for advertising. And now I’ve got Rill Data, we’re a few years into that journey and focused on an operational business intelligence product with rail.

Kostas Pardalis 01:54
All right, that’s why the gentleman nickel. And I know that’s part of this journey, alternative Glue, flex some very interesting technologies, like druids. And from the conversation we had there, you’re like, I’ve learned a few things that I wasn’t aware about druid and the relationship you’ve had like with bi and what were like the initial ideas behind it. And I’m super excited like to get into that and like, learn more about like how things you know, like how you started, like building druid while you did that, and how you ended up like, today, actually, with Rill Data that has like, deletes on like the back end. But it’s more than like a query engine. Right. So I’m super excited to get into the details. What about you, what you’re, like, excited to talk about today? Yeah, well, I

Mike Driscoll 02:47
I think there’s, I think, a few big macro trends that we’re seeing in the data world. Today, I would say I would be delighted to talk about some of the emerging data engines that are out there for powering fast analytics at scale, really, at any scale. So druid and ClickHouse. Also, duck dB, we all know, is an exciting new engine. But I think the other trend that, for me, is particularly exciting is the trend towards serverless frameworks. I think if we’ve those of us, and I know you all of you pay close attention to the space. I think that there’s a lot of new frameworks out there for really taking not just, you know, data technologies to the cloud, but making them serverless in the cloud. And so I look at, yeah, you know, almost any area of the data stack, I think is being remade to be truly serverless at scale in the cloud. And that’s pretty exciting. That’s a pretty exciting area. That’s going to take several years to play out.

Kostas Pardalis 04:08
Yeah, I’m 100%. We’ll have a lot of talk about that to show Eric, what do you think? Should we dive in?

Eric Dodds 04:15
Well, let’s do it. All right. Well, I want to talk about druids. I don’t think that we’ve covered this extensively on the show. But maybe you can help us understand druids by telling us the origin story, and sort of where it came from and your time at meta markets. Sure.

Mike Driscoll 04:39
Yeah. The story of the druid really is quite similar to the story of a lot of technological innovation. Its necessity is the mother of that innovation we met in markets was started in the early 2000 10s as an advertising analytics business. I was the CTO and co-founder there. And we were building basically an exploratory BI tool for some of the largest digital advertising platforms that were emerging back then. And as you can imagine, the data that we’re looking at consists of billions and billions of advertising events. I’ve often said that, you know, in general advertising is the crucible of a lot of technology innovation. It’s one of the first industries that was kind of fully digitally transformed, right, the digital media was already made of bets. And so unlike e-commerce or other verticals, digital media advertising really adopted a lot of data infrastructure technologies, and invented a lot of data infrastructure and technologies much earlier than other verticals. So here we are dealing with, you know, billions and billions of records, we had an early customer base here in California that was called Open X actually was their name, they’re still running today, one of the first programmatic advertising businesses to get real time buying and sign ads. And, yep, we tried a lot of different databases. I first started with green plum, which was a distributed Postgres engine I’d worked with, and tried to build dashboards that were interactive. On top of that, we struggled at high concurrency. We eventually moved to a technique that people still use quite a bit, which is we put everything into HBase, a key value store. And we precomputed all of the different kinds of projections of the OLAP cube, and stored those keys and values and HBase. But that quickly becomes untenable. As you kind of expand the dimensionality of your data. We’re kind of you know, it gets kind of massive. And so then, an engineer that we had hired out of LinkedIn, very talented. At the time, a young guy named Eric chatter showed up and said, I have an idea for a distributed in-memory OLAP engine, and we were young and possibly naive, and we thought, Alright, let’s give it a shot. So druid I think we started work on it maybe late, maybe late 2000 2011. I think Eric, maybe actually, early 2011. Eric wrote a spec for it. I have it on my blog, actually, where he wrote out the architecture and a few 100 lines of, of code 550 word requirements, Doc. And then about, you know, eight weeks later, the first version of Drupal was in production. That was April 2011. We open sourced it at an O’Reilly strata conference in October 2012. And so yeah, the rest. And since it obviously, it’s been widely adopted by lots and lots of companies, probably most notably like Netflix, Lyft, eBay, Salesforce, Pinterest, Yahoo. And of course, metal markets used widely, and we were acquired by snap. I know, snap still today runs a pretty substantial

Eric Dodds 08:59
drew a cluster. Wow, what an incredible story. Because, you know, if you’re a company that’s providing a BI product, and, you know, you told someone, well, we’re gonna build our sort of real time analytics database. Probably they would say, that’s a really bad idea, you know, like building a database, you know, as an internal tool. But what an incredible story with a wide adoption of druids. Did you ever imagine that it would be adopted that wisely when I heard that widely when he started? You know,

Mike Driscoll 09:32
I, I think what we were I don’t think we expected it to get adopted so widely, I think, in some ways. You know, the sum of I believe the architectural advantages of druid are that it was very purpose built. So we weren’t trying to create at the time a general purpose database we were trying to solve our own problem. So I think that turns out to be an advantage of that level of focus, because we were able to sidestep a lot of requirements that we would have had to incorporate if we’re trying to build a general purpose tool. But for instance, it didn’t support joints initially, I would say even today, I don’t think it’s known for great, you know, joint support. But I think you know, what happens when you solve your own focused, well defined problem is that turns out other people have similar problems out there. So I think the decision to open source it was one I give some credit to Vinod Khosla, who was one of our early investors at meta markets, he supported that decision to open source it. Part of the reason we did open source it and it did gain adoption is it was not the focus of the business. We weren’t trying to monetize dread. We were trying to really, I think, be part of a broader ethos and Silicon Valley, which is to create more value than you capture. And we were huge beneficiaries of lots of other Apache ecosystem tools. And it just felt like the need to do was to give this back. And yeah, I think it was fairly surprising. I think a lot of the credit goes to the engineers also who like engineers love working on open source tools. And so there was a lot of investment by that early team, to evangelize druids to give talks about it, you know, to go help others that were trying to get it running at scale. So it may have been surprising the adoption, but I think it was also a lot of effort that went into kind of driving that early adoption in the valley.

Eric Dodds 12:00
Sure. Well, and you know, it sounds like you’ve tried a lot of other tools before you ended up building it right, which is expressive, of there being a big need for it. Can you I’d love to, to, you know, you mentioned this, but its purpose built for pretty specific use cases. Right. And you mentioned joining us, you know, not great joint support. As a characteristic of that. I’d love to know what other characteristics are of druids and what it really shines at. And maybe you can help us understand that by starting with the problem, the very specific problem, you were trying to solve metal markets, like what were the reports that you were trying to build, that none of these other products could support?

Mike Driscoll 12:49
Sure. Right. So I think, you know, fundamentally, and maybe, as you know, that aside, I would say, well, it seemed crazy to build our own sort of in house data engine to power, you know, our BI tool. I do think if you look at the BI landscape, we’re certainly not alone in that decision. You know, Power BI, is powered by verta pack, which is quite powerful. You know, the OLAP engine. And Tableau has hyper, you know, that ‘s its internal engine. So if you look actually at Qlik, which is an inspiration for the BI tool that, you know, we built at meta markets and share her continuing with Rill Data. Qlik also has an internal engine. And so I think if you look at these BI tools, and the the problems that they’re generally trying to solve, and and, again, we went into the benefit, verta, pack, and hyper and Qlik, and even si sense as an engine, none of those engines were open source. So we weren’t able to adopt those meta markets. And we’re building our analytics visualization tool. I would say there’s a few kinds of primitives that are really important to support in the kind of ad hoc exploratory business intelligence tool that we built. First and foremost, the most important is filtering. So the ability to look at a data set and then filter, in the case of billions of, you know, digital media impressions, filter on all of the impressions that are coming from cnn.com. That’s a really critical thing if people do it all the time. And they’re BI tools that’re often thought of as a drill down. The way that is the number of techniques that druid uses but fundamentally one of the core data structures under the covers is basically just You know, inverted indices. So, you go through and you, you essentially index you know, all of your, all of your columns, and you have a column named, you know, publisher website, and cnn.com and gets tokenized into, you know, a number. And then you store an index where that has all of the places where that particular value exists. And you can do a very fast lookup, on and on that data, and then aggregate, you know, only over values that match. So, those are bitmap indices primarily. And so druid makes heavy use of these bitmap indices to do indexing of a high cardinality, dimensional columns in the data. And I think that’s the same technique that a lot of the other BI engines use as well. Makes total sense. And tell us about the real you. So meta Marquez developed Jared, open sourced it.

Eric Dodds 16:14
You sold the company to snap? Can you tell us a little bit about your time at snap? And then I want to ask about the real? You know, because you’re sort of returning to druid, in a way. So yeah, tell us about the time at snap and kind of how they leveraged meta market technology. Yeah. So

Mike Driscoll 16:37
I think for the team at meta markets, I think we always had aspirations of selling this very unique exploratory analytics tool to multiple verticals. I think ultimately, what we found, which is, again, note, no surprise, given my competence around digital media, is often at the crucible innovation for data infrastructure, the companies that had the most data that really needed the this analytic stack that we built meta markets, which consisted of pipelines, real time ETL pipelines that fed into an Apache druid data layer, which then powered by an interactive visualization tool, that kind of three layer stack, turned out to be very valuable for digital media businesses, and our customers ended up being AOL. And Twitter was actually one of our largest customers, and a number of leading platforms in the advertising space. What started as a commercial discussion with snap in 2017, turned into an acquisition conversation as can sometimes happen, and snap at that point was looking to accelerate their internal analytics roadmap. And they were definitely behind at that point, what Facebook, internet meta and what Google were offering to their advertisers. And so meta markets turned out to be an extremely valuable technology asset for SNAP to bring in house and actually build out their own internal and kind of advertiser facing BI platform. So what we learned at snap, which was interesting, is that, of course, the this druid powered analytics stack had a lot of value beyond just advertising data, it soon became something that was used internally to look at lots of other data streams at scale, at snap, including snap telemetry data. So another thing that snap was going through at that point was they were attempting to roll out their Android app. And you can imagine the amount of telemetry. I’m not a mobile app developer, but I do know that developing Android apps is hard, because you have such a variety of different hardware platforms that you have way too. And so this meta market stack became a critical part of, you know, really, I would say, kind of operational intelligence at stat more broadly. So everything from you know, Crash Reporting and bug bug investigation for their application, certainly for looking at their monetization, how many impressions, you know, and how many, you know, what sort of monetization results they’re getting for their advert their advertisers and And it also, it also was used widely by the not just by engineers by sales team and customer success, folks. And so I think just being at snap and watching that wide adoption of this tool, internally was the inspiration for thinking, hey, could we take this? Can we do more with this? And so, after a couple years at snap, I, I exited. And I was really kind of fortunate that I was able to actually license the core minimarket IP, back out of snap. And that became the genesis of Rill Data today. So we really just saw the power of this platform, and really the generality of it. And that was, that was the inspiration to start building data now over three years ago.

Eric Dodds 21:00
Very cool. And one thing I want to ask about real you know, there are certainly a lot of technologies out there that are available outside of druid to do this sort of thing, which I want to ask you about. But the technology landscape has changed significantly since you created druids. Can you give us a picture of how druids have evolved over time, right? Because I think he said, you know, 2011, you open source it in 2012. And so we’re talking about, you know, sort of the early days of the Cloud Data Warehouse there even which itself has, you know, has changed significantly. So I just love to hear about the story of dreads had obviously a ton of staying power, but relative to you know, sort of database world has been around for quite some

Mike Driscoll 21:55
time. Yeah, I think. I think the market certainly shifted. The technology landscape has shifted dramatically, since Drew was created in 2011, and open source in 2012. And so I would say, what are some of the major shifts today? Probably, you know, if I were starting metal markets today, and we were looking for an engine to power interactive exploratory data visualizations, we almost certainly would not need to create druid, there’s a lot of other we’ve, you know, I think we’ve we’re all familiar with a number of pretty powerful engines out there that are quite similar dribs you’ve got a patchy Pino, which I think is fantastic for tickly for streaming use cases, you’ve got ClickHouse, which is great, I think in terms of its simplicity and ease to get running on a single node. And then now I think it supports quite well, you know, tremendous scale, and just in a distributed manner. I think, you know, even a lot of the cloud data warehouses have gotten faster and better. I think they’re still not quite. I don’t know what I would want to run my bi stack. Our might be AI applications directly on a data warehouse, like Redshift, or Snowflake, or BigQuery. But they certainly got faster. And, you know, approaching some of the speed that druid ClickHouse Pino offers. Yeah. So I think it’s a very different world. Now, I still think that there’s still a need for fast engines when it comes to user facing analytics applications when it comes to data applications. And so what’s probably changed the most is that you can delay the decision of going to a distributed system longer than you used to be able to, I think, the reason why duck TB has gained so much attention lately, is because, look, in the early days of Hadoop, you know, you couldn’t wrangle a billion records that easily on a single machine. And, you know, Moore’s laws had, you know, eight cycles in 10 Plus cycles since the early days of when Hadoop was created. So, and similarly, spark was created in an era where machines were smaller and you needed to kind of run things in a distributed way. And so I think maybe the one of the biggest changes is that now we can run much bigger data workloads on single machines and I think, duck TB, I think its popularity is a reflection that like you may not need spark. You may not need druid, or ClickHouse or Pino to get the kind of fast interactive speeds that you may want for your data applications.

Eric Dodds 25:07
Fascinating classes, I could keep going here, but we’ve entered the realm of talking about the current technology landscape and duck dB. And so I can see your hand reaching for the microphone. So yeah, go for it.

Kostas Pardalis 25:21
Thank you. Thank you very much. So Michael, I want to ask you something. Because I think there’s like a very unique opportunity here with droids, because we have technology that has been out there for like, 10 years now. Right. And as you said, I think some of the subjects have been worried, like communicating how different things were in 2012, compared to 2024. Right. And I’d like to ask you, when druid came out, what was, let’s say, the main competition, like what people and I will say competition like this will take us up, like in terms of like business competition, but more of like, how people were like solving the problems back then? And how is it today? Like, when do people today go and use? Like, when’s a good time for someone to go with us? Like, do it? Considering all these changes that you mentioned about the hardware, the software, the market needs? Like everything has changed in these 10 years? Right. Yeah,

Mike Driscoll 26:31
You know, I would say, what’s interesting is that some things don’t change nearly as much I think, as you know, as people, you know, as we might think some things do change. But I think what, what’s basically the key features of the engine that we developed, I think there’s a few kinds of decisions that were made in that architecture that are, you know, that are powerful. And by the way, I think these architectural decisions, again, still remain necessary at scale today. So one of the first decisions was, we need this to be a distributed database, right? We cannot exceed what we can fit on one node. So we need to make it distributed in parallel. And I think if you look, you know, a lot of the tricks of the trade of making things faster across different data tools, is essentially, you know, make them parallel. The second thing that we really focused on was aggregation of data. So, you know, there was a post, I think, and one of the DBT, DBT, labs, blog posts about like, Introduction to OLAP cubes. OLAP cubes aren’t, you know, aren’t going away, people still use them to aggregate data across dozens of different dimensions. And instead of storing raw event level data, storing aggregates, which can be independent, how you do aggregation can be between 10 and 100 times less, have less of a footprint than your kind of event level data. And then the third piece, I would say, is just indexing. Right. And there’s lots of ways to do indexing. But each of those pieces, you know, parallelisation, in a via distributed distribution, aggregation and indexing, yet, that still remains, I think. But today, you know, I mean, I can say that for our customers today, at REL, and frankly, even our customers back at meta markets. You know, and I think this is true of a lot of data applications. Customers don’t care, the end users don’t care about the engine that’s powering the application. They just care about the user experience. And so I think that, you know, anyone who’s starting to build a data stack today, there’s a lot of different tools out there, I would just encourage you Drudes, one of them. But ultimately, you have to pick the right one. Depending on your scale, just pick the right engine that can deliver, I think, you know, fast sub second performance for a data application and, you know, you’ll make your end users happy.

Kostas Pardalis 29:34
Yeah, okay. You give me like a, like a very interesting cue here, because you said like user experience, and I think we have to make a distinction here. We have a special feature with a system that is like, a user facing like you have someone who’s not necessarily an engineer there is going to do their analytics, maybe even like the business user, right when we’re talking about BI. So we have the US They’re, they’re who they care about a specific set of things around, like the experience that they have. And then there’s also developer experience, right, like it’s all about being responsible for deploying, operating, and building what the users need. And I think we need both. We need to balance both at the end like in this environment that we have, can you tell us a little bit more about that, like, what’s let’s say, the user experience that you talk about, like what it means for a user? What more can you define in a few words like service user experience, and then talk mostly about how the developer experience involves different on how it differs, like compared to the user experience?

Mike Driscoll 30:43
Well, I think that the probably the most important value that we embrace and the design of the BI tool that we’ve got today at Rill Data, which again, is quite similar to the BI tool that we built at meta markets, the most important value is simplicity. So I think, too many data tools are really built, probably with the king of the developer persona in mind are the, you know, the sort of sophisticated analyst persona in mind. But when it comes to interacting and exploring and asking questions of data, I think one of your guests, from Alteryx, made this point, like every knowledge worker, is a, as an analyst, every knowledge worker needs to be a data worker. And so in reality, we really have focused on simplicity. And some of the UX pieces of that are like direct interactions. If you want to know more about, you know, a value in the tool, click on it, and you can filter on it. If you want to zoom in on a time period, you should go to Dragon, you know, drag that sub range and be able to zoom in easily. So we really focused on simplicity, where you don’t need to get training to use, people shouldn’t have to be trained on how to use dashboards. They’re such a part of the fabric of modern, modern work, that none of us are being trained in how to use a lot of the great tools that we use, day to day, and I think dashboards should be no different. And data tools should be no different. The second value and talking about user experience is speed, speed of interaction. So that term business intelligence, you know, when we think about an intelligent person, we think about somebody who responds to a question, within seconds of us asking it is slow, I think is often synonymous, but you know, unintelligent, and so at will, we really have focused on making our data, exploratory data application, sub second, and you know, the experience of sub second tools. That just resonates with the human cognitive system. This is how we interact with the physical world and is a sub second way. And I think we’ve all gotten, unfortunately, too used to slow data applications. I think that’s a consequence of some unfortunate architectures that have been built. But in reality, we really want to return to speed, to be at the forefront of working with data. And then the third, the third value that we really embrace is scale. And maybe just recognizing that, in our experience, what may start out as a small data set that you can keep simple and keep speedy, often evolves to be quite a large data set, most of our customers tend to grow. Some of our customers are dealing with trillions of data points. And so thinking about scalable systems, it does mean you have to make certain decisions. And one decision, we made it real for the user experiences, we do require a lot of upfront modeling of data. We don’t let people kind of play fast and loose with their data model. It’s not that we don’t really embrace a lot of ad hoc, or like post hoc changes to data. We really focus on how we want our organizations to invest time building their data models, and then the result of that is that we can support that third value of scale. Because if you’re going to scale up to billions or even trillions of events in your data, you do have to have a pretty Well thought out data model to start with. So, yeah, simplicity, speed and scale are the three values that we think are directed towards a better user experience in the real product.

Kostas Pardalis 35:14
And what about the developer experience? Like? What’s the difference there, like with a developer who has to go and manage, let’s say, Rill Data, or Pino or any other system, like as part of a broader data infrastructure there, right? Like, what are from your experience that you like the good and the bad things that are happening out there? Today? Yeah,

Mike Driscoll 35:38
Well, I think there’s always this sort of Yang, yin and yang, in the world of technology or things that, you know, swing from one side of a continuum to the other. One of those is server versus client. So I think one thing that we’ve embraced at REL, and I think a lot of developers seem to like, is the ability to do development locally, versus development kind of remotely on the cloud. And I think those of us who kind of do local development, we know why we like that, as developers, it’s this. It’s the speed of interaction, the speed of feedback. So I would say that’s one, one, almost shifts, right? We, I think we continue to see the value, we have these incredible, you know, most of us have Apple silicon on our developer machines, and incredible an incredible amount of computational power at, you know, underneath our keyboard, it’s a tragedy to not be using that, that power in our day to day experience as developers. So that’s one piece, I would say another that there’s been some debate, as people often ask, okay, you know, low code, or no code interfaces vs. Code, full interfaces. I think that at Red, we’ve made the decision to be very much a code first developer tool. And everything we do, from defining data sources, to designing data models to configuring the look and feel of our dashboards. Everything is basically defined in sequel and YAML declarative artifacts. And I think that for developers, if you can be thoughtful about the code that you choose, we really made sure we lean into SQL as our primary language for data modeling. A lot of other BI tools have kind of proprietary, you know, data modeling languages, like DAX, for Power BI, or Tableau has its own expression, language LookML, you know, for looker. But everyone knows SQL. So I think the code first approach, I think, does serve developers. I think CLI eyes can be extremely powerful. And again, well crafted CLI has sparked joy for developers. And I would the last thing I would say, when it comes to those, you know, whether you kind of embrace a code first path, or you know, a no code or low code path, in the era of AI, I think there’s a quote from someone on Twitter that, like, text is the universal Universal Interface. Your code is such a powerful interface for the world, you’re here we are essentially communicating about lots of things just using, you know, effectively, speech. I think that in a world of AI, I think the code first interfaces will dominate, because that’s an API. So for real, it’s not hard to co pilot and develop unreal, because everything we do is code first. It will be very hard to have a copilot interact with a set of UX components and design, you know, dashboards and data models and data source credentials. If everything were kind of point and click. So yeah, those are probably two things, I think a lot about the code versus no code approach. And the local versus cloud development approach.

Kostas Pardalis 39:23
Not super interesting, actually. And the very good points about like, what’s going on with copilots and the AI situation right now and how they work well with cold first interfaces, instead of like these drag and drop, which I never thought about, is very interesting. Which, okay, so let’s talk a little bit about like, the present and the future of like bi and I’ll take you a little bit like back in the past. So bi went through, let’s say already, some kind of a cycle. where we had around like 2015 2016 would say, like, we have Looker, we had sizings. We have Periscope data, we had mold analytics, we had Chartio, we had all these different, like BI tools that some of them were targeting, let’s say like other personas, and they were trying to differentiate based on that. But what eventually happened from what it seems is like the peak of that cycle was the acquisition of Looker by Google, which I think was also the biggest outcome in the spec, and things got a little bit. I’ll tell you that. Not that exciting anymore. Right? We’ve had some merges that are happening with like, sigh sense and Periscope data. I think more now, like both acquired by another company, but that’s fine. Yeah, sounds good. Correct. And it’s not very clear, like, where the cycle ended? And if there is a new cycle of innovation, what is going to happen bi and what’s happening behind General? Right. So those a little bit about that, like, what happened in the previous cycle? Let’s say in your opinion, and like, what’s next?

Mike Driscoll 41:21
Great question. I think the, in a lot of ways, I think bi, the cycles of BI, and the different kind of spin outs and mergers and consolidations and acquisitions do reflect similar to the world of databases, I think, one might ask themselves, you know, when you see, new database company’s been started, you know, in the last few years, like, gosh, do we really need another database? Right? And I guess, my, my broad view on on sort of, on cycles of BI would be just that, look, the world of data is so massive, and so critical, you know, to the global economy, and to every business, that it in the same way that like, we don’t have, you know, just kind of, I guess, one type of manufacturing company that makes Adams, we, there’s really not a lot of grand uniformity when it comes to kind of manufacturing bits, right, whether it’s ETL, or databases or, you know, exploratory business intelligence tools. So, so I think my first comment is that I don’t see anytime soon, this sort of massive consolidation around, you know, one, one database or one BI tool to rule them all, I think the world is far too heterogeneous, in terms of its problems for that to be the case. But as far as the kind of current cycle and bi goes, I would say, I think probably, I would argue that there’s maybe three generations of BI we can really point to, and we’re kind of in the third generation here. I think the first generation was desktop, and this kind of server is, you know, bad server based BI. So I think the Power BI, as an early, you know, business intelligence tool, I think, back years ago, you had, you know, Oracle had a BI tool that they shipped, you had SAP, you had a number of kind of all I would call them a kind of old school, in the 1990s. Companies that were shipping, desktop BI tools, or, you know, beefy server BI tools. Qlik was in that category as well. And many of them had, as I mentioned, before, embedded database engines that they came with. And that was generation one. And that worked pretty well, for kind of, I think the nature of Enterprise Architecture. There’s then, but I think the big shift that occurred with and frankly, Looker, I think heralded this era was where we had the shift to cloud bi. Looker was one of the first companies to really embrace that they weren’t going to have an embedded engine Looker was going to run on top of other databases, it was just going to have its, you know, its semantic layer talked directly to a cloud data warehouse. So Looker grew, I think, very quickly because the cloud grew and people realized that was a better, I think, a better architecture. Ultimately, some of the legacy BI tools did embrace that server architecture, Tableau, you know, allowed it to connect to remote data warehouses as well. But, that was sort of the second generation that I think we saw and by the way, I think ModInfo hotspot probably represents that second generation as well, you know, mode primarily talks to, you know, a remote data warehouse thoughtspot increasingly if you know about connecting to remote systems. I think now, we’re in a potential third generation of BI. And what’s different. Now, I think, you know, as I mentioned, we were chatting before the show. I think that the next big disruption in the data stack is going to be the commoditization of the Cloud Data Warehouse as the sort of source of truth for company data. I think that more and more companies are embracing object stores like S3 and GCS, and Azure object storage. More and more companies are embracing structured data on object stores as their core foundational data layer. And as that happens, I think we need a new generation of data applications that can connect directly to the object stores, and not just rely on the Data Warehouse like Looker did. And so that, frankly, is where we’re suddenly making it bad at REL. We’re making enormous investments, and support for things like Delta Lake. And Apache Iceberg, also the commercial support for it by Tabular. And I think there’s a lot of exciting stuff to be done with that new architecture. So, you know, as we move, basically, through these three generations, we go from, you know, kind of server architectures for BI, when we go and move to kind of cloud warehouse architectures be at for BI, and I think we’re now in the era of object store architectures for BI. And I think there’s a lot of innovation that can be done in this kind of new, this new data architecture.

Kostas Pardalis 47:00
That’s super interesting. So in this new modeling that we are talking about, like how do all these pieces fit together? Right, we have data warehouses, we have data lakes, we have BI tools, that they have their own engines, right? We systems, like more, the more real time systems like B know, or like druids and ClickHouse? How, like, how do these things fit together? And do they, let’s say, overlap, or they’re like some clear boundaries there. Or let’s say, like a user, like a company has to grow in order to start considering using, like some other technologies?

Mike Driscoll 47:51
Well, I think that, again, I think it’s still early days, but my own view on how these pieces may fit together, you know, some broad thoughts. First of all, I think that all data will ultimately live in the data lake and will ultimately live as your Parquet or I know there’s going to be your guests as the creator of Lance TV. All you know, structured data will live in a, in a structured data lake in an object store. In the future, I think that will be the governing, you know, lowest common denominator of data across most organizations. And so that means that all data producing and data consuming systems, we’ll go through that foundational object store fabric, I think Microsoft actually got it right, when they talk about their, their fabric architecture, it doesn’t make a lot of sense to try directly, in my opinion, fairly rare use cases would you want to consume directly from Kafka? I think if you look at like, even what, you know, the folks that work stream labs are doing, they’re using Kafka backed by an object store. It’s serverless Kafka. So I think that, again, all data, data technologies, data, services will be created and written to and read from the object store. So then that does simplify things in a lot of ways having that kind of fabric there, then you just have different requirements for different styles of data applications, you’d want to power off of that data. For business intelligence applications, it’s really important that things are fast. And so you’ve got to, you’ve got the only way to make sure that things are fast as you need your computer and your storage to be co located in some way. So you have two choices. You can either move the data to the computer, you move or you can move the computer to the data. Both of those I think are acceptable. I think, you know, a tool like duck DB is very powerful because it allows allows you to move compute to the data you can spin up, like, you know, a Lambda job and stick a duck DB in it, and you can run that compute very close, you know, in the correct region, or you have fast, you know, access to the, to the object store. And rails case, we, you know, we decide to actually orchestrate data out of the object store and move it to an aggregate and move it to our compute nodes. But I think colocalization of data and computation is a key piece, I would say. But in general, other workloads don’t need that, you know, for a lot of reporting workloads, one of the challenges we see today is people are constantly moving data between data systems, one of the advantages of having everything in the object store is you don’t need to do that migration. So I think reluctantly, Snowflake and several other tools have embraced the Iceberg format, I think we’ll see that continue to expand and its adoption. And the idea there is that, you know, for asynchronous workloads, you don’t need to move data into Snowflake to query it with Snowflake, you can query an external table from Snowflake, and not have to do you know, an ETL job and move terabytes of data into Snowflake, on its proprietary format. So yeah, so I think, you know, a lot of it depends on the nature of the workloads. But increasingly, I think we’ll see a lot of in situ data applications that operate on the data effectively sitting in the object store. And that’s, I think, a huge efficiency gain. For that style of architecture versus a lot of the, you know, a lot of the systems today that you know, where you have a lot of data moving around.

Kostas Pardalis 51:52
Yeah, that makes sense. All right, one last question from me, and then we’ll give the mic back to Eric, because we are close to the end here. One of the things that has happened in the past, like two years that is changing, I think, like rapidly, the space with data is AI, right? And I especially think, like BI tools have been very good to embrace that. for very obvious reasons. Like I think, as you said, text and being like the API, right? It’s like a very strong concept there. But my feeling is that there are probably much deeper things happening. With AI in the car we will change the way we work with data. So what’s your take on that? Like, how do you see bi being affected by AI? And what’s next there?

Mike Driscoll 52:44
Well, I’d say maybe three consequences that I can think of, I think I once commented, thinking about AI is I think we’ll know we have AGI once we have solved for data engineers not having to write regular expressions on their own. I think one of the first and highest uses of AI is actually for data wrangling. We all know that we practitioners and data spend far too much time doing Friday regular expressions, parsing data. I think that tremendous benefits will emerge on that front. Through things like co-pilot, I think we can dramatically improve and reduce the pain around data munging with AI. Second thing I would say is that in terms of its impact on the languages that data practitioners use, AI, as I said before, you know, obviously AI is code base today primarily prompt based. And I think that we will actually see a lot of people who have been trying to create new languages for data transformation. And I applaud those efforts. We all need new languages, always. But I think that SQL is still, you know, still early days of SQL being adopted not just for querying data, but increasingly for transformation of data and ETL and data modeling. And so I think that AI is going to further propel SQL, just because it’s such it’s a lingua franca. There’s so much for these large language models to learn from in terms of the massive corpus of blog posts and, you know, StackOverflow answers that are using SQL to manipulate data. So I think AI will actually propel SQL to even greater dominance as the lingua franca for all data work. And then I would say the third consequence of AI and the data space is I think solving the cold start problem, I think AI is great at sort of in generating a scaffolds of, of something that then, you know, an analyst can edit versus having a crate from, you know, from whole cloth. And so in particular, the area that I think AI has great potential for data work is, we’ve seen this already with the open eyes analytics module. You know, a lot of people spend a lot of time pushing pixels, when it comes to building data visualizations, to make their data look pretty. I think that being able to go from a data set to an informative, useful visualization of that data set or generating, you know, eight or 10 different possible visualizations of a particular data set. I think AI has got great potential to aid in that somewhat creative task that not all analysts are great at. So those are three areas, I would say, propelling the dominance of, or helping out with ETL, propelling the dominance of SQL and providing a path for beautiful data visualization, without a lot of effort.

Kostas Pardalis 56:18
Right up, so some might have plenty more questions, but I think we’d have to reserve them for another episode, Eric, all yours? Yeah, well, we’re

Eric Dodds 56:28
right here at the buzzer. But, Michael, what a fascinating conversation and you have, you know, such a long and fascinating career in data. But I have to know, we’ve talked so much about data on the show. If you couldn’t work in data, or technology, what would you do?

Mike Driscoll 56:47
If I couldn’t work in data or technology, but when I do, I would probably be. I think my secret dream when I was in college was to be a script writer. I wrote a stand up comedy show when I was in college. And so I would say if I were not working dad, I would probably be Yeah, maybe trying to work and Hollywood writing bad jokes for late night TV.

Eric Dodds 57:21
I love it. That’s so fun or right for Saturday Night Live. Yeah.

Mike Driscoll 57:26
I don’t know if I’m funny enough for that. But I certainly was. Yeah, I would. It would be a fun job, even if I knew I may not have been the best at it. But yeah, that was my alternative dream.

Eric Dodds 57:38
I love it. Well, Michael, thank you again for sharing your time with us today. We learned so much and best of luck as you continue working on Rill Data.

Mike Driscoll 57:48
Thank you, Eric, and cost us. Thanks for having me. And I look forward to meeting up in person sometime here in the Bay Area. Thanks, guys.

Eric Dodds 57:57
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 181:

OLAP Engines and the Next Generation of Business Intelligence with Mike Driscoll of Rill Data

March 13, 2024

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter