This week on The Data Stack Show, Eric and Kostas chat with Arjun Narayan, co-founder and CEO at Materialize. Materialize is a streaming database for real-time applications and analytics that allows users to get extremely complicated and complex analytics answers in real-time on top of streams.
Highlights from this week’s episode include:
The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 00:06
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.
Eric Dodds 00:27
Welcome back to the show. Today, we get to talk with the founder of a company building a database product, the company is called Materialize, and Arjun is the founder of the company. And I’m super interested to talk to him. As I think about our audience Kostas, the biggest question that comes to mind is, what are the immediate use cases from a tool like Materialize that, at a foundational level, can take jobs with data that are generally considered batch and happen over a long period of time with a lot of latency? And essentially turn them into real-time jobs? Analytics is absolutely a use case that I think makes a ton of sense. But I’m sure that people are doing all sorts of other interesting things. So that’s going to be my big question. As far as use cases, analytics is obvious, but what else can you do when you go from batch to real time in the context of a database? Kostas, you love Materialize and I cannot wait to hear what your burning questions are.
Kostas Pardalis 01:28
Yeah, yeah, I mean, okay, first of all, like, what you have in your mind? I think it’s a great question. Materialize is a very, let’s say, novel way of interacting with data and consuming data. So it’s very interesting to see what people are doing with it. So absolutely, I’m really looking forward to hearing about the use cases. I have a lot of questions myself, to be honest, I don’t know how much we’ll manage to cover today, most of them are going to be technical. I want to learn more about the technology like the secret sauce behind Materialize as a database and also a perfect from a technology. It’s also like a very interesting product, like the ergonomics that this database has is very, very interesting. So I’ll have quite a few different questions that will help us better understand the technology behind it, and also some choices that the team has made in building this new database system.
Eric Dodds 02:25
Well, let’s jump in and talk with Arjun.
Kostas Pardalis 02:27
Let’s do it.
Eric Dodds 02:29
Arjun, welcome to The Data Stack Show. We are very excited to talk with you because there’s just so many data topics that we could cover in this conversation. And we probably won’t have time to get through all of them. But welcome.
Arjun Narayan 02:45
Thank you very much. I’m excited to be on the show.
Eric Dodds 02:48
Let’s just start like we always do with we’d love to know your background.
Arjun Narayan 02:51
I’m Arjun Narayan. I’m the co-founder and CEO of Materialize. Materialize is a streaming database for real-time applications and analytics. It allows you to get extremely complicated and complex analytics answers in real time on top of streams of data as opposed to once a day on top of batch data. It looks and feels exactly like a SQL database. I started Materialize a little over two to say two and a half years ago. Before that I worked in a different field of databases. I was a software engineer at Cockroach Labs working on CockroachDB, which is an OLTP scale-out, horizontally scalable database. And before that I did a PhD in distributed systems and big data processing. I’ve sort of lived, breathed, and been in data for a while and a little bit by accident. I didn’t intend to fall in love with databases, but as I learned more and more about how they power most of our applications and experiences that we deal with computers, they just became endlessly fascinating to me and I’ve spent a decade looking at databases at this point.
Eric Dodds 04:08
I love that. With a PhD in anything related to databases, I think that you have a lot of technical acumen, and I love the sentence: I didn’t mean to fall in love with databases. I feel like that’s the beginning of a novel … that may have a very specific readership. Okay, Materialize is super interesting. I think a lot of our audience is very familiar with working sort of in and around your traditional database data warehouse, right. So Postgres, the usual suspects when it comes to data warehouses, Redshift, BigQuery Snowflake is obviously taking over the market. And there are really common paradigms within which you can run SQL, you can create views, etc. The syntax is a little bit different depending on the warehouse, but for our average listener who may be let’s just take an example, they are a data engineer, they do a lot of work getting data into Snowflake, they create views, they create different use cases for analytics teams, etc., for that person who may not be familiar with Materialize, could you just paint a picture of if you introduce materialize into the stack, what does that look like? And what are the key benefits that it brings?
Arjun Narayan 05:28
That’s a great question, I think it helps to break down the standard paradigm of where most databases fit in, in the traditional worldview, and then we’ll introduce how Materialize sort of brings some new capability that’s different from what’s currently in the market. So databases, and this is going back, say several decades at this point, traditionally fall into two large buckets. There’s the transactional databases and the analytics databases. So transactional databases are your Oracle, your Postgres, your MySQL. They’re generally speaking, focused on processing lots of transactions that may potentially be conflicting. They’re sort of the point that decides what events are allowed to happen. So they reject some transactions, they accept some other ones. And then they’re very good at writing those transactions down. They’re very focused on avoiding data loss. That’s something you really, really want from your transactional database. Then you have your analytics databases, your BigQuery, your Redshift, your Snowflake. Your analytics databases are more focused on enabling far more powerful compute. Typically, and SQL databases, people use SQL in both settings, in the transactional setting, and the analytical setting. But if you take some of these complex queries, say it’s joining eight tables together, some of these tables are very, very large; those queries, if you ran them on a transactional database, a transactional database would A) most likely fall apart, and B), if it didn’t fall apart, it would probably greatly slow down your other concurrent transactions. So there’s a reason people mostly separate the systems. If analysts type some large analytical query about last quarter sales, you don’t want all your cart checkout to triple in latency, right. So it makes sense, it makes perfect architectural sense to separate these concerns, and then also build separate systems that are optimized for these different classes of workloads. The big thing that most people give up today is your analytics query runs on a dump of the data that is somewhat stale. So this is feeding your batch data warehouse with a once a day ETL. I mean, ETL, really extract transform load is about getting data out of the transactional system and putting it in the analytic system. It’s being traditionally … it’s getting less painful, but it used to be an extremely painful process, you would run it overnight, once a day, some folks are now running this multiple times a day, but it is still fundamentally a batch operation, which means there’s a large class of analytical queries, which fundamentally are always going to be somewhat out of date, right. And you can bring that down to being just an hour out of date. But the missing opportunity that we noticed was that there’s a class of analytics or analytical style queries that are incredibly valuable to have in real time, which don’t make sense on a transactional database. But existing analytical databases or data warehouses are not equipped to do so because they’re fundamentally built in this batch paradigm. Materialize flips the setting a little bit, which is, instead of computing your answer off of a data set from scratch, when the query is presented to you, it pre-materializes, some set of questions that you’ve pre-registered with Materialize, and this is why the company has been named Materialize. So if you might be familiar with the term materialized views, the entire point of a materialized view is you tell the database, hey, I’m interested in asking this question on a repeated basis. Can you please pre-compute it for me as the data changes? In the past, most materialized view support in most databases has been highly restricted, right? So you can do it for fairly simple queries. But if the query gets fairly complex, the database really wants you to ask it and then it’ll go ahead and do the work rather than doing a whole bunch of redundant work that has to be immediately thrown away the moment the data changes. So under the hood, Materialize is an incremental query processor and we can talk a little bit more about the technology because this is the thing I don’t think I’m describing anything that people haven’t wanted for a very long time.
Arjun Narayan 09:53
The unique thing that we bring is a novel set of underlying research and technologies that allow this to happen in an elegant fashion. But Materialize allows you to ask these complex analytical queries on each sub-second, say, single digit millisecond latency, even when these queries are very, very complex. This is more than just about taking some analytics query that you asked once a day and making a live dashboard. Now, absolutely, a lot of our users start by taking something that they computed once a day, some very valuable metric, and making that into a dashboard. So they can see it on a more real time basis, especially in say, the financial and the trading use cases. They can never have these things fast enough, right. But the more interesting thing happens when you start to put these live, changing data and taking automated actions off of them. So you could think alerting, you could think personalization in an application, as you get real time data, as opposed to realizing that a customer that should be segmented a certain way, and then doing an email marketing campaign the next day, by the time your OLAP job finished. There’s a wide variety of uses where you can action, while users are on your website or while a transaction is still pending, the form has been authorized or declined, if it’s a card transaction. It’s much more valuable to make a precise judgment as to the quality of that user or that transaction within say, a 10-100 millisecond budget versus doing that overnight and reacting to it the next day.
Eric Dodds 11:36
Absolutely, I mean, this is fascinating. And we’ve had several conversations with different businesses where this is where they’re heading with their architecture. And e-commerce comes to mind just because it’s a situation where you have a lot of data, a lot of it needs to be enriched or combined with other data, right? So data from transactions or ML models, and all of that’s happening in some sort of database. And the challenge has been, we’re creating all this value of the data that we have. And it’s very difficult to deliver that with speed, right? And e-commerce, if you want to send a personalized coupon right after purchase or something like that, that needs to happen very quickly. But the latency has been really high just due to technology. But that’s changing. And that’s really, really exciting. So super, super interesting.
Arjun Narayan 12:33
Absolutely. One of the things that we see is the amount of folks who are putting in the capabilities, and we’re very much in the early stages of this architectural transform because folks are pretty much just putting in place the streaming infrastructure to move the data at low latencies and at high volumes, right? So this is doing change data capture out of their transactional databases on an ongoing basis. So that milliseconds after a transaction commits in Postgres or MySQL, it is present in a Kafka topic that can be used for these downstream consumers or applications. And the early adopters have gone ahead and built these manual micro-services, right. So the absolute earliest adopters have adopted this microservice pattern, which comes at a huge cost, right? So not to mention, just the development cost of building these manual micro services, but the ongoing maintenance and upkeep costs that these micro-services introduced, when you want to just say change a little bit of business logic, right. So changing business logic sometimes takes a full quarter, because you have to shut down or upgrade these microservices in a controlled fashion. And perhaps it’s as simple as something that would be very simple in a database, like joining against another stream, ends up introducing a massive amount of architectural shift, as you now have to build and manually maintain an extra set of states that is introduced by adding on that third topic. So these are the sort of costs that people currently pay that we want to reduce. So we think that building these streaming, micro-services, streaming applications, right on top of the stream should be as easy as building a crud app using a MySQL database. Today, it’s not, but with Materialize it is.
Eric Dodds 14:22
Yeah, well, I want to dig into some of the technical details, because there are a lot of questions that Kostas and I talked about, but before we get there, you mentioned something around moving just beyond the basic analytics use case. And that’s something I just want to talk about briefly. People use the term digital transformation, which is a buzzword, but on the spectrum of digital transformation, you have companies who have figured out the analytics thing and they’re moving towards more interesting use cases. But there are a huge number of companies where their analytics get refreshed every 24 hours, just because they’re relying on technology that is relying on the batch load paradigm, maybe without data tech. What are you seeing? I mean, there are a lot of companies who I think could just benefit from the analytics use case in and of itself. But the real use cases that really move the needle are the ones where you’re actually delivering personalization or other really dynamic customer experiences. But I would just love to know what you’re seeing as you talk with your customers and people who are interested in adopting something like Materialize? What’s the balance? Are a lot of companies still trying to figure out the analytics use case? Or are there more companies than we think, who are actually doing some really interesting things around like the customer experience?
Arjun Narayan 15:46
That’s an excellent question. To me a large part of this comes from where your analytics team is, right. So one of the one of the amazing things that has been happening in the industry is that analytics teams have become progressively more empowered to do more and more and create more value for their organizations, and now are starting to get into building these applications or building something that ends up being surfaced in the core application. The way I think about this is that analytics pretty much starts with a human in the loop, and then analytics starts really coming into its own once the analysts themselves are trying to figure out how to get themselves out of the loop, right, and how to make these things automated. So I think a lot of the analytics journey to real time and streaming begins with augmenting the human capability by giving them more life. But where it truly comes into its own is when we start doing automated actions directly off that analytics pipeline. There’s a huge benefit to everyone in the organization, whether it’s the application, or the analyst, speaking the same language in terms of defining the metrics that they’ve been thinking about in the exact same way. DBT is, of course, absolutely the leader in creating an ecosystem where an entire company’s or an organization’s data is modeled using a single unified paradigm. And starting from the analyst, and then going towards the application, I think, is the correct way to do things. I absolutely encourage most folks to take their first steps by moving say, a once a day, refreshed dashboard into real time, because A) it’s an enabler of a lot more things, and it’s a good way to ensure that all the application, and the real time in-application experiences are fundamentally based on the exact same vocabulary that is that is already part of the analytical organization.
Kostas Pardalis 17:48
Arjun, this is great. Actually, before I start asking my questions, I have to tell you, I really enjoyed your introduction. I think it was one of the best descriptions of the difference between the two, database parties that we have, which is pretty common, like many people are asking about why we need to have an analytics database and the transactional database. So that was amazing. If you haven’t written blog posts or something about that, please go and do it. I think many people are going to thank you for it. But I have a couple of like, let’s say a bit more technical questions I want to ask. And let’s start with the materialization. You mentioned that you also chose the name because of the concept of like materialized views. Why would someone use Materialize and not just you keep using the materialized view that a transactional database, for example, offers.
Arjun Narayan 18:44
Excellent, well, thank you so much, Kostas, I appreciate it. I should read a blog post. This is a great question in terms of why not just use the materialized view in say Postgres or mySQL? Well, the first answer is, if your materialized view becomes the slightest bit complicated, you will lose the ability to incrementally update it. So it’s really about what is the update strategy for this materialized view, because for a complex materialized view, let’s say you’re joining four tables together, you have some subquery in there, you have some non-trivial aggregation, maybe some max and some group by or something of that sort. The first thing in OLTP, or even an OLAP database is going to tell you is you have to manually tell me when to refresh the materialized view. And then when you do that, I will essentially run the equivalent of a select query and then stash the result in a table for you to query. It gains you almost nothing compared to repeatedly issuing select queries.
Arjun Narayan 19:47
The hard part, the technologically hard part, is the reuse of previously computed results to efficiently update the materialized view. A good way to think about it is you want to do work proportional to the changes, not proportional to the query load. So if somebody asks a select query, and very little has changed, you shouldn’t force your database to do a massive quantity of work. Data has changed, but does not affect the result. You want that to essentially be suppressed as early as possible. So a good example of this is if I’m summing a bunch of rows, and then somebody added a bunch of zeros, we should quickly detect that, and not throw all our results out and re-compute everything from scratch.
Arjun Narayan 20:35
A large amount of analytics workloads that happen in data warehouses today are fundamentally redundant queries, where we are mostly recomputing the same answer. So if you have terabytes of data, most of this data is historical, right? Like, big data is absolutely real, but it’s primarily a phenomenon related to the amount of data we have collected, you don’t have big data every second, well, Google might have, but most organizations today, like the amount of data that is coming in second by second, is not that voluminous. But when your queries are fundamentally nonlinear, they’re joining a bunch of different things. The database sort of looks at it and goes, well, I don’t know what’s changed, I kind of have just thrown it all out and started over from scratch. And that’s fundamentally the paradigm that we want to get away from.
Kostas Pardalis 21:23
That’s great. Another question on that, why would I like to have incrementally updated views instead of having something like a caching layer and cache the results of a view?
Arjun Narayan 21:36
Well, the hard part is deciding when to invalidate your cache. Right. So what you get from an incrementally updated materialized view is this logic is handled correctly, perfectly without the user having to do anything more than think one of the one of the cute taglines we use internally is think declaratively, but execute incrementally. So it allows you to still think in terms of what’s fundamentally the select query I’m trying to run? And then we think through all the hard parts of what is the data flow that has to happen under the hood, which parts of these are stateful, which are stateless, which ones invalidate cache? If you’re building a micro-service, you’re gonna have to reason about all of this yourself, build a microservice, a stateful microservice. And this is hard, and you might get it wrong. And if you get it wrong, it’s really subtle to debug, it’s difficult; generally speaking, most people use databases, because inventing half a database that you happen to need for this particular use case is a risky thing to do and very hard to validate if you did it correctly. Mm hmm.
Kostas Pardalis 22:41
So we also find a solution to one of the hardest problems in computer science right? When to invalidate the cache. So that’s great.
Arjun Narayan 22:49
Yeah, exactly. It’s that and naming things.
Kostas Pardalis 22:53
Yeah. Yeah. All right. So what’s the secret sauce? What’s the magic? Like? What is different in Materialize converting, like what Postgres is doing, which is, I don’t know, probably one of the most complex databases ever built. We built it for like the past 30 years or something. Right. So what’s new and what is different with Materialize?
Arjun Narayan 23:16
That’s an excellent question. I don’t want to talk negatively about Postgres, I’m going to take the flip of the questions like what Postgres does that we can’t do right, so Postgres is a great OLTP database. In fact, we love it very much in the engineering team at Materialize because Materialize speaks as close as possible, wire compatible, Postgres, right, so for an application that’s talking to Materialize, you use Postgres client drivers, you use the Postgres native language bindings, and it all just works. So we’re huge fans of Postgres. Postgres is a great OLTP database. What Postgres does very well that we don’t do is transaction isolation and concurrency control. So if you have, say, a unique index or a primary key field, and you have two people racing to commit transactions, Postgres will ensure that only one of them succeeds, right? It’s great. It’s great at this conflict resolution and consistency aspects of the ACID properties that you want from a database. What we’re very good at, is computing these denormalizations, these complex views and keeping them incrementally up to date. And we actually work very, very well downstream of Postgres. So one way that some of our users deploy Materialize is they have Materialize, essentially acting as a read replica, right? So Materialize connects directly to Postgres, the transactions, all the rights land in Postgres, and then get immediately replicated within a millisecond or a few to Materialize and then Materialize gets to maintain all these rich analytical indexes that essentially are kept incrementally updated as soon as the data comes in. This way, the writes all flow to Postgres, and then the complicated reads, essentially it offloads compute from Postgres now. How do we actually do this? So under the hood, Materialize is built on this state of the art stream processing platform called timely data flow. Now timely data flow was invented, or co-invented by my co-founder, Frank McSherry, who has done a lot of stream processing research for, I think, coming up on 7-8 years now. Timely data flow is a fully, you know, fully horizontally scalable stream processing framework on which we’ve built a query planning and dataflow planning such that we can take an arbitrary SQL statement, or SQL view definition and convert it down into a persistent data flow that is that is horizontally scaled out on this timely data flow cluster.
Arjun Narayan 25:50
There are some folks who use timely data flow directly as a stream processing library. It’s an open source project. But most people don’t want to do this, right? You don’t want to write, so timely data flow is written in Rust, and you don’t necessarily want to build and write Rust data flows and manually orchestrate them, right? So we think there’s a large market for people who want those benefits of that incrementally updated high performance scale out, blah, blah, blah. But they think about their computation the same way they’ve thought about the computation for several decades, which is they write and define SQL queries. And these SQL queries just stay alive, and they don’t really think about it. And these things just stay alive forever as the data changes.
Kostas Pardalis 26:32
Hmm, that’s really interesting. And so how is timing dataflow different compared to other solutions out there like Flink, DataBricks, and the rest of all the streaming processing platforms that we have seen in the market until now.
Arjun Narayan 26:48
That’s it. That’s great. So first off, I’m going to do sort of a bad job answering this, but there’s a wonderful research paper called Naiad, a timely dataflow system, which won several academic awards that leaves the foundational case for timely data flow and how it’s novel. There’s a few things not all of which we currently take advantage of in Materialize today. But a good example is timely dataflow is capable of reasoning about cyclic data flows, whereas most other dataflow models are purely acyclic. It is extremely expressive, almost to a fault. So driving timely data flow around is hard and something that we take a lot of pains to do correctly at Materialize in the Materialize database layer. It is it is data parallel across a sharded data flow graph in a way that most other dataflow engines are not. So today, most data flow systems say Flink or Spark Streaming, the primary way in which they scale across to use many more compute resources is by taking various operators of the graph and placing them on dedicated CPU resources, and flowing data from a data flow node to another data flow node. So if you think about that, a good way to get intuition for this is, let’s say you have two sources of data, each of which have some map operation. And then there’s a joint operation. And there’s some subsequent map or map or filter or things like that, each thing from this graph of computation and each one of those nodes gets their own dedicated compute resource. Timely data flow is sharded, in a very, very different model that results in a very, very higher performance, particularly in cases where you have very, very large data flows. So let’s say you have a SQL query that has eight different input streams, complex subquery, things like that, the actual execution graph of this may actually be hundreds of nodes. You as a user may not care, you just want that SQL to be incrementally updated. Getting that Dataflow graph to get high performance in some of these other stream processing systems is very, very hard. Whereas with timely data flow, because of the way it scales up and has this shared, cooperatively scheduled Dataflow execution model makes it far, far more performant. For more details, I would point you to the research paper, because I’m struggling a little bit to convey some of the nuances without the references, diagrams, and some slides.
Kostas Pardalis 29:18
Yeah, it makes sense. Makes sense. I was aware of the Naiad paper and also the timing data flow model. But I think it’s something that people out there like another community out there, are not that aware of. So I think the more we can communicate and talk about it, I think the better it is for everyone to start understanding, like thinking in new terms, right. Because as you said, timely data flow is like a different paradigm of how you can process and whenever we introduce a new paradigm, like it takes a lot of repetition from the people who know about it, and they evangelize these to help the people out there understand it. It’s very interesting because we had an episode pretty recently with CockroachDB. And one of the topics that we were discussing was how important it is today for the engineers out there to start thinking more into like, like, get some distributed elements from distributed computing and start incorporating them in the way you think, as an engineer or as a developer, right. And I think this is one of the values that we as people here sitting together and discussing about interesting technical topics that we can offer to our audience out there, how we can give them some guidance of, yeah, you know, something, there’s a different way that data can be processed out there. Maybe you should also start trying to think about this. Or yeah, you might be a web developer or like a front end developer. But still, if you start thinking, and using some of the patterns that come from distributed systems probably can help you with your work, and also can help you work much better with the backends that probably are distributed behind the scenes. So that’s why I find it always like very, very, very, very valuable to discuss a little bit like more technical details.
Arjun Narayan 31:13
Absolutely, I strongly agree. I think it’s very important for developers building and using systems like this to understand and appreciate what the right principles are, one, so they can choose the right technologies to work with or the appropriate technologies for the problems that they’re solving. But one of the things we may struggle with, and I appreciate you pushing a little bit on this is to what extent should we encapsulate and hide the complexity versus unwrap and show the complexity. So one of the big advantages of Materialize is you don’t have to know, you just write SQL. But there’s a sort of inherent tension where you know, actually, A) everyone is interested and definitely wants to know and B) maybe understanding will get you the right intuitions for what compute computation you can even execute and how to go about choosing the right architecture to build which systems you can incorporate and not incorporate in your architecture.
Kostas Pardalis 32:08
Absolutely, I totally agree. So Arjun you mentioned, that by incorporating this new timely data flow processing model, Materialize achieves to be very performant compared to like the rest of the solutions out there for streaming processing. What kind of resources should someone who wants to start using it today consider about setting up the open source version of Materialize.
Arjun Narayan 32:33
So we aim to make Materialize very simple to use. You go to our website or our GitHub, you click the download button, and you can run this on a single node, you can scale up this node to handle … in fact, if you get the larger sized VM and you run Materialize on it, you can ingest a million messages a second, you can you can install dozens of views, and so on before even needing to consider whether you need a multi-machine setup. As part of making it easy to graduate beyond this, in fact, you know, you will be very productive on a single node database. We really go to great lengths to make it as easy to use as a database, right? So you run it on a single node, you connect to it using a SQL shell, or a SQL driver in your language. The lived experience is very much like Postgres, right, like, this is how most people run Postgres is they run Brew, install Postgres, or app, get install Postgres, and they run it, and then it’s living in a VM by itself in a cloud for years of uptime. So that’s really the easiest way to get started.
Arjun Narayan 33:37
We are building a cloud service, which we are launching publicly next month, which allows folks to get even more advanced features. So some of the features that we will be shipping in our cloud product is horizontal scalability, where you have these very, very large data volumes, well north of a million messages per second, for instance. And you do need multiple machines in a horizontally scaled set up to absorb that data volume. And then two, for having replication, right, so if you have extremely high availability needs, you’re going to want multiple servers set up in an automatic failover capacity. And that’s something that our cloud product will not next month, but down the road also support.
Kostas Pardalis 34:20
That’s great. And I’m very excited to hear that you are launching a cloud version of the product. And I want to ask you more about this. But before we go there because we are going to spend some time on it. I have a question that I don’t want to forget to ask. And that’s about you mentioned at some point that timely data flow, it’s implemented in Rust. So how did you decide to use Rust? What was the reason behind that?
Arjun Narayan 34:48
I think the original reason was Frank when he started coding timely data flow. He had recently left Microsoft Research and he had been coding for a while in that sort of dotnet ecosystem, and he wanted to try something new. And Rust was a beta programming language at the time, a very risky thing. But he was just playing around. I think a lot of these open source projects start that way. But so timely data flow was coded in Rust. Now I think for highly data intensive applications, the best choices are Rust or C++, because the manual memory management and control is quite important for predictable low latency experience. I think there are some places that have gotten good at writing in Go. Go is a garbage collected language. And not manual memory management.
Arjun Narayan 35:43
So I had some experience because I was a software engineer at Cockroach. CockroachDB is written in Go. We struggled with it a little bit. I don’t think it’s impossible. I think you can definitely, with enough sweat and effort, essentially drive the garbage collector around to do the kinds of things that you would have wanted to do in a manually managed environment. There’s pluses and minuses. Rust, we doubled down on Rust when we built Materialize, because one of the things we could have done is we could have left timely data flow as the Rust underlying engine layer, and then built the materialized database management layer in a different language. And when we looked at that design decision, we thought about it a little bit. And we came to the conclusion that Rust was actually pretty great. And we were quite happy to build it on Rust at all layers of the stack. So Materialize is 100% written in Rust, and we’re quite happy with that. I mean, I’m happy to go into more detail as to our experience on building in Rust and maybe contrasting a little bit to the Cockroach experience in Go as well.
Kostas Pardalis 36:39
Yeah, that’s very interesting. And I’m asking you, because Rust is like a pretty young language, but it’s gaining a lot of traction lately. And it’s a very interesting language also from, let’s say, a research perspective in terms of what kind of primitives they’ve added there, in order to do like this kind of memory management. It’s very interesting. And, of course, it’s very interesting to see that it is starting to be used for systems out there that get in production and in products that are delivered out there. So that’s why I was very interested to hear your opinion about Rust. And something that it’s about Rust again, but like from the perspective of being a founder and building teams, right, so how easy is it today to find developers out there that can write in Rust, or who are willing to write in Rust.
Arjun Narayan 37:22
Right, so we don’t expect our engineers to know Rust when they join, although many of them do, certainly not all, we find that it takes a reasonable amount of time on the order of a few months to get productive in Rust. This is probably the biggest cost that we pay as an organization for building a product in Rust. There is a bit of a ramp up time that we have to pay. But that’s fine. It is not difficult to find people who want to work in Rust. In fact, I would say it’s a significant attraction to several engineers who who maybe if they’ve written C++ code, and they’ve lost so many weeks of their life to chasing down some memory leak or some manual memory management bug and they want to move to a language or an environment where they get the benefits of manual memory management, the performance and they also don’t have to deal with that class of bugs. So we find quite a few people are very excited to work in Rust, although we do have to take some time to let them ramp up.
Kostas Pardalis 38:25
And what is the reason that it takes like a couple of months to start being productive in Rust. But I think this is probably one of the main contrasts with Go because one of the benefits that I hear at least from engineering minds about Go is that it doesn’t take that much time to be productive in Go. But why does Rust have that? Like it takes like five to six months to get productive.
Arjun Narayan 38:52
I wouldn’t go so far as five to six. I think it’s more like two to three months, assuming we have an experienced software engineer who has been building the back end or distributed systems which pretty much all the engineers that we hire fit that mold. The primary difficulty … and by the way, having worked in Go at Cockroach Labs, most people can be productive in Go in under one week. It’s a truly incredibly concise language to get productive in. It’s sort of, I would almost say, optimized for productivity. The primary difficulty with Rust is that it is most folks have a little bit of an adversarial engagement with the compiler it can be a little bit frustrating to essentially what you’re doing when you’re writing a Rust program is you’re giving it sufficient type annotations, that it is able to prove that certain classes of memory bugs are probably absent. So it’s a little bit of you are guiding a not very smart computer because it’s not a human to follow a proof and there’s a little bit of it’s too dumb to see that the code you’ve written does not have a memory leak. It is often called fighting the borrow checker. So the borrow checker is the part of the compiler that yells at you. And there’s this standard failure mode of fighting the borrow checker for a while until you fully internalize the limited ways in which the borrow checker thinks. And then, you know, oh, this is where I should probably add this annotation or do this thing or use this pattern in order to get the compilation step.
Arjun Narayan 40:28
The other thing I didn’t mention is, and this is a place where I see, given the novelty of Rust, this is a negative, is that there’s not that many libraries and pre-existing tools that you can draw for a rich open source ecosystem. It’s very different from Go. In Go, like pretty much if you’re looking for some compatibility to some driver or some library or some parsing library or some security thing like you, it’s a very rich, mature ecosystem compared to Rust, where oftentimes we’ve had to write a library from scratch, because whereas if we were writing in Go, we would have used an off the shelf one.
Kostas Pardalis 41:09
Yeah. Make sense. Although from my limited experience with Rust, I have to say that Cargo is a very nice experience for package management. So yeah, there’s always trade offs. Like it takes time for the community to build everything there. But with the traction it has, I think it will catch up pretty fast.
Arjun Narayan 41:29
For sure. And also some of the things that I’m saying, they’re not going to be downsides for people coming after because they’ll be more software engineers who are already fluent in Rust. And hopefully we are a contributor as well, adding some of these libraries that we’ve open sourced and other people as well. So a year from now, it’ll be even easier. So these are just growing pains.
Kostas Pardalis 41:47
Yep, absolutely.
Eric Dodds 41:48
You’d asked about how to get started with Materialize. And I just wanted to jump in really quickly, because we talked about, obviously, the open source offering and then super exciting that you’re launching cloud. Arjun, one quick question. I’m just thinking about our audience here. What use case would you encourage them to start with?
Arjun Narayan 42:05
I think the simplest way to validate and to the lowest cost, the fastest time to value would be to take some single report and move it into a real time dashboard. Because you should be able to do this within a matter of a couple of days. And really validate that the technology is capable of taking arbitrary SQL that you have, business logic in your organization, and move it to real time. And then that’s the position from which we can think through the more complex things like actioning or integrating this into a pipeline, that sort of is part of an application experience. But, getting this value in as short a time as possible is what I would encourage folks. So that pretty much means some pre-existing business logic or a pre-existing DBT model, since Materialize has a DBT plugin. You should be able to take your pre-existing DBT model and make it work on Materialize ideally in a single day.
Eric Dodds 43:03
Oh, very cool. Wow. That I mean, that’s an extremely fast time to value and then just one more quick tactical question for our listeners. Just go to Materialize.com to get notified about the launch of the cloud product?
Arjun Narayan 43:16
Yes, that will be front and center on our homepage. And in the meanwhile, you can download the source available free product from there as well.
Eric Dodds 43:24
Sure, great. Okay. Sorry, Kostas, I know we’re close to time but I just constantly think about our listeners. And I love learning about new technologies. And I just want them to get the fastest way to understand how I can get in and kick the tires on it.
Kostas Pardalis 43:40
Yeah, absolutely. And it was very good that you asked this question, Eric, because it’s time to spend a little more time on the cloud version of Materialize. Can you share with us a little bit more details about the whole experience of trying to build a cloud offering for a product or like a framework, like Materialize like things that you expected beforehand and didn’t happen and things you didn’t expect? But they happened like anything interesting that you can share with us about this process of turning this amazing piece of technology into a cloud offering?
Arjun Narayan 44:22
Absolutely. The first one I would say is the biggest reason why we’re building a cloud product is by far we talked to our users, we talked to prospective users, we talked to basically everyone in the industry, everyone has this wide consensus that everyone wants to use a managed cloud offering of pretty much all of the technologies that they use. Because running and upgrading and manually maintaining these things is not something that most people are interested in doing. Particularly as things get more and more this way. You’d much rather have somebody else carrying a pager than you carry a pager. The more mission critical this gets, the less you want to be in charge of carrying that pager when that system, you know, might go down. In terms of building a cloud service, one of the things that’s very exciting, and this is particularly true for companies like ours, where we’re building this, from day one, knowing this, that the cloud product is the predominant way in which we are going to, you know, be successful as a business. You get to think in terms of atomic components that are cloud native. A very, very good example of this is separating storage from compute. So storage is infinitely scalable, extremely low cost service, namely S3 or or the S3 equivalent on the other major clouds is available, and an extremely high durability, extremely strong guarantees that you get from the services is a building block that you can build, say, a database around, that means that there’s an entire class of problem that you don’t have to engineer for, namely, data loss or, or data corruption or replication or things like that. You can, you can rely on this atomic unit of an S3 bucket being the principal storage layer for the vast majority of your data. And what this means is, of course, you get to use your engineering budget, instead of solving the same problem that everyone has to solve pre-cloud, you get to use this to solve the new problems, right. Another one that you get is the ability to other services that are cloud native, safe for other components. So a good example of this is going back to Postgres. materialized cloud uses highly available Postgres into nodes under the hood, for certain classes of metadata and things like that. Whereas otherwise, if we were building a fully on premise piece of software, getting this highly available would be a long engineering challenge. At the same time, we love users who just want to use the source available product, or they want to use it and deploy it in their own premises, the key distinction I would make is, we’ve designed Materialize Cloud, such that the best place to get the highest number of nines of availability is Materialize Cloud. So things like active active replication, automatic failover, load balancing, these are built using cloud native services, and owned and operated by us as part of Materialize Cloud that are not part of the downloadable on premise offering. And that’s because fundamentally, these things are designed using cloud services that are not portable, right? Like you can’t take S3 on your laptop, and yes, you can emulate it for testing. But that’s not how you would run a production service.
Kostas Pardalis 47:43
Absolutely, absolutely. Operating a software and building a software are two different things. So I have a question about the cloud offering compared to the experience that you described about Materialize from the beginning. And it has to do with latency, right, you said that materialized is a system that you can expect, like single digit latency when it comes to the queries that you execute and the updates that you have. My intuition says that in order to achieve that, if I’m consuming data, Materialize from a database system that I have, I have to have my Materialize nodes as close as possible to my database. How can I do that when I use the cloud offering?
Arjun Narayan 48:31
So the first point I’d make is you’re absolutely correct, you want this to be very close to a database. But the other thing I’ll observe is most of the databases are in the cloud. So if you want to be close to the databases, you have to be in a cloud instance, by definition to be close to the databases that are in the cloud. The important part of this is co-locating them as closely as possible. And then it usually would come down to region, availability zone, colocation and things like that. You almost certainly don’t want to move this data across clouds, right. So our Cloud Service is launching next month, on AWS, but eventually, we want to fast follow to Azure and Google Cloud as well. Because if your database is in one of these other clouds and you, you will have too much latency going between two clouds. The other thing I would say is the clouds have gotten the hyper scalar, the three cloud companies have gotten very good at laying extremely high bandwidth, low-latency network connection. So as long as you’re in the same region, and spinning up your Materialize instance in a VM that is the same region and perhaps even the same availability zone as your database. They’ve done a very good job making sure that those actual packets that are going across this virtual network will go over a fairly small physical distance.
Kostas Pardalis 49:51
That’s great. One last question from me, Arjun, and then I’ll give it to Eric. So we can also conclude this episode. You mentioned colocation, and all that stuff. And you also mentioned S3. So for the people out there who are interested in using the cloud version of Materialize when it’s launched, is this going to be on one cloud provider like AWS?
Arjun Narayan 50:18
Next month, we’re rolling it out on AWS, and then a few quarters later, we will be rolling it out on other clouds.
Kostas Pardalis 50:25
Okay, so people can expect that in the next couple of months, like if they are a GCP shop, Materialize will also be available there for Azure and at least the major cloud providers out there.
Arjun Narayan 50:37
I can’t commit to a specific timeline, but one thing I will say is that there always is the option of running Materialize in a VM, that the downloadable source available product in a VM in an Azure region or data center.
Kostas Pardalis 50:50
That’s great. I think we need to have at least another episode, because I have more questions to ask. But I have completely monopolized this conversation. And I need to give at least some time to Eric.
Arjun Narayan 51:03
So this has been really fun. I really appreciate the questions Kostas.
Eric Dodds 51:07
This is great. I think we’re close to the buzzer. But we’ve talked about Materialize a lot just as a team, and Kostas and I because we love discovering new technologies and it really is a true joy just to get to talk with you and just hear about the inner workings in many ways. And I hope this has been a really fun conversation for our listeners. Arjun, this has been such a wonderful conversation. We’ll definitely have to have you back on and congrats on the cloud launch. That’s going to be great. I encourage all of our listeners to go to Materialize.com and check it out. And we’ll have you back on the show maybe in another six months or so after the cloud product’s been live and hear how it’s going.
Arjun Narayan 51:50
I would love to do that. This was an absolute pleasure of a conversation. Thank you both. Thank you, Eric. Thank you Kostas. This is a wonderful show you have over here.
Eric Dodds 51:58
Well, Kostas, I think one of the big takeaways I have and this won’t be my takeaway from the content of the show is that you and Arjun are incredibly intelligent when it comes to very deep concepts around databases and languages that you use to build technologies. And so it was a real joy for me to hear two very intelligent people reason around some of the decisions that they’re making. I think the big takeaway actually relates to my big question on the front end. Analytics is a really obvious use case. But all the other interesting things you can do, when you enable real-time, I think are just going to open up a lot of really creative solutions to problems that are low-level plumbing problems in the stack currently. And that’s very exciting. I mean, coming from a marketing background, I think about enriched profiles and automation and other things like that, and the ability to have this stuff in real-time from a database, I think it will actually be a very big driver of creativity in the way that people are building experiences.
Kostas Pardalis 53:07
Absolutely. You’re absolutely right. I mean, the closer you get to real-time, the more use cases you open. And I think we are just at the beginning of seeing what people can come up with technologies like Materialize. And I’m pretty sure that if we talk again, with Arjun like in six months from now, he will probably have even more use cases to share with us. So yeah, absolutely. Materialize is a new technology, a new paradigm, there are many new, let’s see patterns that we have to learn and understand from there and experiment with. It might take some time for people to figure out how to use it. But my feeling is that we are going to see very exciting things coming from this technology. I have to say though, that Arjun is also like an amazing, amazing speaker. He’s amazing in explaining really complex concepts. So I really enjoyed the conversation. I was really happy to hear about all the technology that they are using to make Materialize products. And I’m also very excited to see what’s going to happen with a cloud version of the product. It’s also very exciting for me to hear that regardless of the technology that someone is building, how this technology is delivered and is used. It’s very important. And cloud is like probably the best delivery model that we have at this point for this kind of product. So yeah, hopefully like in a couple of months from now. We’ll chat again with him and learn even more.
Eric Dodds 54:40
Yeah, absolutely. I think as I reflect on the conversation, a lot of really paradigm-shifting technologies take something extremely complex, and make the experience very simple. And there are lots of examples of that. But being non-technical, but working with you close enough to understand when you talk about anything real-time related to a database, from a technical perspective, that’s an extremely complex problem to solve. And I think if materialized can simplify that, I mean, that’s pretty paradigm-shifting. So it’ll be really fun. And I think if they can accomplish that, that’ll be huge. Awesome. Well, thank you for joining us on the show. Lots of really good episodes coming up this fall, we’re actually about to wrap up season two. So you’ll see that wrap-up coming up in the next couple of weeks. And then we have a great lineup for season three. And until then, we’ll catch you on the next one.
Eric Dodds 55:34
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds at Eric@datastackshow.com. This show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.
Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
To keep up to date with our future episodes, subscribe to our podcast on Apple, Spotify, Google, or the player of your choice.
Get a monthly newsletter from The Data Stack Show team with a TL;DR of the previous month’s shows, a sneak peak at upcoming episodes, and curated links from Eric, John, & show guests. Follow on our Substack below.