Episode 123:

What Is a Universal Database? Featuring Stavros Papadopoulos of TileDB, Inc.

January 25, 2023

This week on The Data Stack Show, Eric and Kostas chat with Stavros Papadopoulos, Founder & CEO, TileDB, Inc. During the episode, Stavros discusses databases, particularly around solving problems at the storage level. The conversation also includes topics such as TileDB’s origin story, how the company focuses on efficiency in storage engines, how academia impacted Stavros’ journey, and more.

Notes:

Highlights from this week’s conversation include:

Stavros’ journey into data and founding TileDB (3:12)
What problem was TileDB going to solve? (12:05)
Defining database systems (21:35)
What part of database architecture is TileDB? (31:58)
Storage engine solutions (42:37)
What does the API look like in using TileDB? (50:40)
What makes genomics unique in working with data (55:28)
Final thoughts and takeaways (1:06:46)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:03
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. Today we’re going to talk with Stavros from TileDB. He’s Greek. So I know you’re going to have a great conversation with him. And he created some really interesting technology. They call it a universal database. And I’m really interested to know what that means. So I’m gonna ask about, you know, what is titled dB, what is a universal database, my guess is that the technology is a little bit more specific and opinionated. But also, he spent a lot of time in academia, which is really interesting, something we’ve talked a little bit about on the show before. And so I really want to hear the story about how TileDB came about in his work at MIT. So yeah, those are my questions. How about you?

Kostas Pardalis 01:16
Yeah. First of all, I want to show you how you’re going to handle two greets at the same time. Sure. Let’s see what the Socratic method is. Oh, yeah, keep asking questions, because we will never finish the recording. So that’s one thing. The other thing is, it’s a very interesting opportunity that we have here because dial DBS actually has started by Bill being one of the more low level and core parts of a database system, which is the storage engine. So the innovation initially comes from that, like, how we can store the data on the very low value in a much more efficient way. That gives us the economics of dealing with the data. So it’s a great opportunity to focus on that and learn more about one aspect of the database systems that we don’t usually have the opportunity to, because it’s more or less taken for granted everyday, I think, given these systems. So I think it would be very interesting to hear from him about how they do that. And also, what’s the story behind, like, why they started from that. And so why they open source systems, all the things are all out to be just the origin, but also as a company.

Eric Dodds 02:51
Well, let’s dig in and talk with Stavros. Welcome to The Data Stack Show. I am so excited to talk about so many things. The primary one of which is all databases, everything databases, and especially titled dB. So thanks for giving us some time.

Stavros Papadopoulos 03:10
Thank you for having me. I’m very happy to be here.

Eric Dodds 03:13
Okay, so g ive us your background. Because you have a long history in technical disciplines, but not necessarily in databases specifically. So where did you start? And then what was the path that led you to founding the title dB.

Stavros Papadopoulos 03:31
So before Attali B is the first company, I’m creating, I mean, tech and plus CEO. So I have a PhD in computer science. I did my PhD in Hong Kong, where I spent several years, then I became a professor, but I have a very deep academic background. It has always been data, not database systems, data, a lot of algorithms, data structures, then I did a lot of work on security cryptography. So you can always battle on the Euro NT 2014, I got an amazing job in two labs in MIT. And that’s when I moved to Boston. And that’s when I started actually working on top B but effective database systems. That’s where I got more experience and I dove deeper into the database systems world.

Eric Dodds 04:23
super interesting. Can I ask what your thesis was on in your doctorate?

Stavros Papadopoulos 04:31
On the result integrity, actually authentication and integrity of results, really. So I was creating a variety of data structures infused with some cryptographic primitives so that we can certify that the results returned for a query are indeed correct results produced by the owner of the data. There are several different nice techniques for that. So Oh, and I build civil structures for geospatial data. And this is what the geospatial Pendo of value becomes from mulberry matching to data structures and cryptographic primitives.

Eric Dodds 05:12
Yeah. Okay, I have another question for you, if you will entertain me, because I’m Scott Cessnas and love thinking about the interaction of sort of academics and sort of backgrounds, and then how that influences sort of building, you know, commercial technology. Having a deep background in academics a PhD in computer science. Could you outline what you think are some of the top advantages and disadvantages, if there are any that have an academic background, as you know, because like commercializing technology out of academia, is hard, right? Like, that’s difficult. Yeah, I just love to know, like, your experience, because you have that so viscerally someone with a deep academic background, and now you know, who started your first company?

Stavros Papadopoulos 06:02
Yeah, that’s a great question. So the business world is 100% different from the academic world.

06:12
small piece of advice to an academically oriented person starting Delius. Start studying. Now, as I did, I start with a lot of quantitative study, surround yourselves, with amazing mentors, from the business side of things. It’s a different world, it requires another PhD. So your computer science PhD is not going to help you there. Now, it all started from scratch. What really helps me a lot until today is, first of all, understanding what product to build, right? I’m very civil involved in the decisions around the design, what features to build certain algorithmic aspects, I have an amazing team. But I’m very heavily involved in a lot of, you know, core core components of talebi. I wrote the original code base of the storage engine. So I understand the code, I do code reviews, sometimes, believe it or not, for a specific feature that I’m very interested in. The biggest advantage is that when I get one defining direction of the product, of course, it’s giving the vision and which for a database company, you need to understand the technology. And you need to understand all this around technologies in order to differentiate innovative, so on and so forth. So that’s extremely important, if the founder, CEO of a company.

Stavros Papadopoulos 07:35
But the other advantage is when I get on a call with the customers, because the customer can tell me in the first 30 seconds, I know what the heck I’m talking about, as you know, like knowing how to solve the downfall.

07:50
Like you’re not in vain, quickly.

Stavros Papadopoulos 07:53
They understand and did it and it’s fully sincere because I do get down into the details. He, for example, has a new problem, which is not exactly a replica of the problems we have solved in the past. And I offer solutions, and I brainstorm with customers and my team and actually enjoy it. This is one of the most enjoyable parts that killed today. And even at this scale, we are

Eric Dodds 08:16
still present. Yeah. Okay, one more question. And then I want to talk about databases. What is it like on the business side? As a CEO, you know, founder of a tech company? What’s the most sort of unexpected delight that you have in the business side that you maybe didn’t expect, right? Is it like managing people or, you know, fundraising? Or is there something where you’re like, Man, this is really fun. And I just didn’t really know this sort of existed on the business side.

Stavros Papadopoulos 08:57
Exactly, but let me tell you a little bit about my experience on that, unless I am forgetting something. I apologize for that. So what I did throughout the company, why I never had any pleasant or unpleasant surprises, believe it or not, will appear in the company five and a half years from now. And I’ve never been truly surprised, because I was always babied for this stage of the company. For example, when I was raising my very first round, I was studying about how a convertible has no distraction . Yeah, so that I know what kind of deal to cut right? And of course, I was talking to 1000 Angel investors and mentors in order to understand what did a good deal and what was not. Yeah. As the company was scaling, I started studying more about culture, right? Like what kind of culture do we want the company to have? Because the first worth especially for people are going to dictate the culture but of course then the first 10 the first 20 and sauces are not the first 1000 Certainly the best small core of people will take, they will determine the, you know, the cultural path of the company. So, I’m, and then of course, when we started closing enterprise deals, I was learning about enterprise deals, the sales cycles, budgeting, procurement, all of that stuff you need to know in order to be very efficient. And the same goes for marketing, when we started doing a number of marketing, how do we brand what kind of traction channels we’re going after, and so on, so forth. So I was really surprised with something. But what is extremely pleasant to me, again, speaking, as a scientist, not so much as a business person, is the results we’re shipping with customers, we’ve been extremely fortunate. And we can talk about this later, to focus on certain verticals that are extremely meaningful, for example, life sciences. And you know, what it is extremely mean, for when your system helps a hospital save lives of babies in the NICU. Okay, so yes, that would be lighter, as a company, right. And we continue to go after challenging and very meaningful use cases, our first niche beat hit markets. And of course, we expand once we do that, so that we can both purpose but also sustainability. And

Eric Dodds 11:27
Yeah, I love it. I mean, I love hearings. You know, it’s, it sounds a little bit cliche, from a business standpoint, but being customer obsessed, you know, is like a key ingredient. And that it’s very clear that you love digging in with the customers and understanding the problem, which is great. Okay, well, thank you. That’s such a wonderful background. Let’s talk about databases. So, you know, creating a database product is a huge undertaking. You’ve been doing that for five or six years now, the database space has changed significantly, even in the last, you know, half decade, five years or so. What problem were you thinking about when you started tiling dB? And how did you decide that a database was the right way to solve

Stavros Papadopoulos 12:18
it. So when I started calling DB, I did not have in mind to create the company.

12:28
And that was not the original motivation. And I was a researcher, I liked my job as a researcher, intellectuals, phenomenal it might be was phenomenal. I was effectively embedded into MIT by Intel. So both the industry and academia hats, which was amazing. I interacted with, you know, probably the smartest people on the planet. And I loved this and it was frightening looking back, and it was a good light. And

Eric Dodds 12:58
That is kind of a dream role where you get like, the best of both worlds, you know, and you start to get to straddle it with Absolutely, I’m gonna like undue burdens on either side.

Stavros Papadopoulos 13:11
Absolutely. And I will continue doing it. Because it’s just one of those, you know, instances in your life where you need to choose. And the reason. So first of all, I started talking to the VSA research project. I always wanted to have high quality in my work. So in the sense that I didn’t want to just write one paper with talebi. I had to read multiple papers in the past. I wanted to write papers, innovate, obviously, but also do something that kind of lasts. So I always had a product mentality, even during my research. I know the company mentality because I was working in return. And the idea was, you know what, if this works, well, maybe we create a database for the labs to maintain and grow very similar to what CWA has done in the past Berkeley, and others, right. So I will, that’s the kind of mindset that happened to time. Let’s build something that is beautiful. They do some technology travel for Intel, try to solve some very big problems and we take it from there. So no company plans in the horror, Athan. And, again, cake, there was no specific problem from a use case standpoint up until something came up in genomics. What I wanted to do was something different in databases. That’s what I want you to do. Also, Please note that when I was at Intel and MIT, I was working at the intersection of high performance computing, of supercomputing and databases in the supercomputing from Intel while working some true mean just you know, High Performance Computing, in optimizing operations to do this CPU, cycling metal and cooler staff into databases, of course, some Stonebreaker in CSAIL. Right. So what I want you to do from a research perspective was to kind of bridge those two very different domains, with very different crowds that don’t talk to each other very flippantly. Yeah. And actually, that was one of the first times that the kudos to Intel and MIT because they partnered for that release. The problem for you in order to, you know, try to combine the knowledge of both domains, in my group at Intel, at the time was doing a note to instill in us, I still keep in touch, they do a lot of machine learning AI, you know, to graph algorithms. So, a lot of linear algebra, a lot of computational linear algebra, which sits at the core of very advanced operations, if you want to do something very advanced in your CPU GPU, in all likelihood you’re using SAM, we’re in the depths of your software stack, some kind of package probably built by Intel or AMD, we’re at that, you know, fully optimized is computation, computational linear algebra operation. In the question I had, in the beginning, that this is how I started, was, wait a second. If you take the data from storage, be it in key values, or documents, or text files, or tables, or whatever. And then we wrangle this data into matrices and vectors in main memory, in order to feed those into in depth zero to one or ability and give you a GPU? Why aren’t we storing the data in this form to begin with? Well, that was the initial observation, like why are we storing the data on anything other than the form that we compute don’t in the end, so it started as a as a very naive observation, then the GIS started to pile up, of course, and then I said, Wait a second, we do store the data as matrices. And the image is a matrix. So now, I see the means to store also a Neiman natively as a matrix and actually can be sliceable. From the DIS logistic blow that I bring in its entirety, I can actually do some arithmetic, in order to slice portions of this very fast. Now I’m starting to also model images in addition to tables. But if I do images and tables, what if I can also do key values, what if I can also do graphs, representing these adjacency matrices? This is how the observations started to pile up again, and again, more and more in their hypothesis at the time in the lab was, is the multi dimensional array, a universal data structure. And by Universal, I mean, both in terms of capturing all the different shapes of data, but also, not just generically, but in a performant way, like, the hypothesis is twofold. K we structure them intuitively, like in an abstract model, but the second part is equally important. Can we do the piecing? Because if you sacrifice performance, nobody’s games. And that’s when I started, you know, pounding on the storage engine, because we start from storage, is there an efficient format, on disk or on a cloud object store, which can represent data off any form, extremely efficiently. That’s what that again, it started as a scientific hypothesis. And then I started thinking, I’m going to be more business wise. And I was talking to a lot of people at the time, I was working with the Broad Institute, which was across the street. So they had a very specific population genomics problem, which at the time had not been thought of as a matrixstore problem or as an array problem. Interesting. And when I model this problem as an array, I experience some massive performance boost, and also imposing tweeting, yeah, there are, you know, three aspects. At the time, there were three aspects of this data center. We needed to index all of these to make it a perfect candidate for an array, and more specifically sparser, there is a difference between dense arrays and sparse arrays. We can talk about this later. Yeah, but this is how it started. And the bottom line was that after talking to a lot of potential prospects, we found that if indeed we build such a database system, which is universal, if we can do it, then we can tackle extremely important problems. As the value proposition of this database system, the first performance for very complex use cases, gender mixing live data and kind of point clouds in magic videos stuff that, you know, relational databases are not very good at. Yep. And number two, we can consolidate data in code assets in a single infrastructure. If we do that, we eliminate silos. If you eliminate silos, you first make people more productive, they collaborate more easily. And second, they, you know, they experience a lot of cost reduction, that only to buy a lot of licenses, they don’t need to wrangle greed. They don’t need to, you know, to take time in order to get to eat science, and so on, so forth. So the business aspects came up later from observation. Yeah, I love

Eric Dodds 20:44
it. Okay, well, I can keep asking questions, because I, you know, I love the backstories. But I will summarize my takeaway on that. You know, you hear about a lot of entrepreneurs, like, trying to find pain in the market to respond to to start a company, and it’s so enjoyable to hear about your curiosity, leading you to conclusions that, you know, ultimately solve some pretty specific problems. So that’s just so fun to hear about your curiosity sort of being a guide. From that standpoint. Okay.

21:25
Let’s,

Eric Dodds 21:27
let’s dig into the technical side, Costas, please take the mic, and help me learn about Title DB and all of the specifics.

Kostas Pardalis 21:36
Thank you, Eric. Thank you. So Stavro, let’s start from the basics. Okay. Let’s start with the database systems. And I’d like to hear from you. Before we get into the specifics of DynamoDB. What database systems are easy, right? Like when we think about something like Postgres, or Snowflake, doesn’t matter. BigQuery? Like at the end, like all these systems, they have some common buttons. All right. So I think it would be great to start from that, like, what are these components, the most universal found ones, and then we’ll get into more details about, like, dial dB, but I think that’s going to be super helpful also for Eric.

Stavros Papadopoulos 22:19
So let’s do that. Oh, this is an amazing question. Cost and because it gives, it gives me segue to so many different things that I want to talk about. So let’s, in order to be able to answer this question, making no assumptions about the audience, making the simplest possible, let’s talk about what you would do if there was no database system in the market. And you’re gonna be surprised, especially the verticals we’re working with, right? Life Sciences, geospatial, those folks are not using databases at all. So let’s talk about what those folks are doing. Okay. All right, the absence of big bases, the first thing that they’re doing is that they’re coming up with very obscure formats, to storing serialize, the data is bytes into files, every domain, in the absence of a database system, he’s coming up with their own forms, and your own parsers, you need also some kind of a command line tool, to understand the format in the file, otherwise, you’re not going to be able to open the file and read it. Yep. So in the absence of a database system, you as a user are responsible for serializing the data somehow, saying, I’m gonna put this field first and then the next one, and then explain to the parser, or write the parser. So that you’re able to parse this format. And what they usually do is that they are forming consortia to define the specifications of those formats, so that everybody agrees on. And of course, changing the format because the technology changes, makes this very problematic, because the technology advances much faster, that much faster than the specification. So this specification stays behind the, you should have no flexibility in changing this. Because otherwise, you know, the downstream analysis tools, which rely on this will never be able to work. And therefore, you’re stuck with files. Now on the cloud, and with the advent of cloud, this becomes even worse, you need to store 1000s, or millions of files on some kind of storage bucket, index them with metadata and maybe use a database if it’s a single day index. Now this file corresponds to that person, that person has access. And yeah, speaking about access, you need to define your own solution about granting access to the data. So what is this Orange. The other part is who has access to the data. And with a distributed file system, you’re gonna call your IT person and say, Hey, give access to this user, me. So they’ll be here, accessing that folder. And on AWS, it’s very similar given me in my role, so that I can have access to a prefix in a bucket. And, of course, this can create a revocation kill, because you need to keep track of all the keys and all the roles and like, effectively, you’re creating a lot of work for yourself, one to store the data, the other to manage the data. And when it comes to the analysis, there are so many different tools that get created that reinvent the wheel. The reason is that all you share from a domain is a bunch of files. So if you want to do some kinds of analyses of those five analyses you need to build the system. Do we need to build some kind of program to do that? And most of the programs, they have some common components, they mainly have to run some statistical analysis like principal component analysis, or linear regression or something like that. So every single tool that implements those are links to particular libraries in Mexico. And again, a lot of reinvention of the wheel because those polls share massive components on the analysis one. So this happens today, imagine if we did not have to add on basis, we would do the same thing for a CSV file, we would store the data in a CSV file called Parquet files, that’s fine. And then every single tool would try to reinvent a sequel operator, like a workload or like a filter projections, right, like protection selections, and joins and every single tool we’d have to implement a different one. So what is the database system? Does the database system abstract all of that? Does it store the data in some way? Yes, sure, there are some common formats like Parquet, but you know, some Oracle doesn’t go up to you, the actual storage format. And in the past, nobody cared about the storage format. The database system cares about the format, and it evolves it the way it wants to evolve it, so that it becomes faster and faster without asking anybody’s opinion about before. And then they save, again, the storage layer, a way to parse the data, they have an execution layer, with operators, right implementing so many different sorts. So diverse functionality and computations. It has a language layer, some kinds of API’s in the database, and the most common API is SQL, SQL, API. And, of course, a parser, to parse this and translate it into, you know, something that is actionable. In all those layers, all of those, everything that I can explain appears in every single database management system that exists in the world, beyond tables for transactions, tables for analytics, key value, graphs, images, and so forth. And that was exactly my observation, that all of those layers are common. The only thing that is intact that systems differ on is the serialization of the bytes on the disk. And then some API’s, different systems use different languages and different API’s. And those are set, we don’t want to you know, right there, those problems can be solved. And then the query engine is not as different as you think, like if you decompose a query into some kind of an operator tree or an operator graph, and then you can dispatch the execution. That was my observation. And I said, Okay, if that’s the case, if I choose the format wisely, and I abstract the operators wisely, so that it’s not just a workloads or a filter, right, and a filter joining in, I expanded them, so that it is also a matrix product in mineral products, and so forth, then maybe I can create something more general than just an SQL engine. So this is what the database system is. And that was the original observation that all the database systems in the world share those components. Absolutely. That was great, I’m sorry, I did forget. I didn’t forget the very important one. The access control layer because I spoke about

Stavros Papadopoulos 29:41
This is super important. Yeah, cuz I cannot read it out. Data. The system is responsible for forcing, authentication, access control, and logging. You don’t want to reinvent all they have to be part of the database system.

Kostas Pardalis 29:59
Absolutely. Congress, I think we can have probably at least one episode for each one of these components to, Oh, easy to the powers. And they’re like some, like super fascinating things. That’s, I think, even like people who even build, let’s say, databases and forget, like, I always finding, like amazing the fact that if you think about like Sigmund writes like, he’s a language, that’s what he does is like, you use it to describe how your data want to look like. And then the database systems take these stimuli to figure out what the data should look like, and actually generates code in a way to go and create this date. I assume. I’m abstracting a lot here, because I’m going to be many different ways to generate this code, right? But if you think about it, like, it’s very interesting with a very different approach, compared to going right and being called to do the same thing on your own right?

Stavros Papadopoulos 31:00
It’s weird to think about this, in order to see the absurdity of trying to build a theory system on your own. Other than making a company, if you have anything in common, it’s fine, because you’re going to try to raise capital and then build an amazing team and then do it for a meeting, that’s fine. But it’s absurd to try to do it in their hospitality and genomics institution or in their geospatial company. But this is what is happening. And just this is what they’re doing. It’s uncertain for the following reason: just go to sleep mode, or we’ll be like, these are the conferences I used to hang out at. Right. academic life, you got to find professors, big professors working on this sub component. They have the components I mentioned. Yep. More Life. Yes, it generate generating hundreds of papers with a lot of innovation, so that you can understand how difficult it is to build such a database

Kostas Pardalis 31:59
system. How is it 100% So okay, based on this architecture, that you describe the components of a database system, where he’s thrilled to be?

Stavros Papadopoulos 32:11
What parts of this architecture phased out? Absolutely. So there are, again, there are many things we need to touch upon here. And I’m gonna explain the Talib evolution in order to see what we have built and what we’re building down the road so that you see where we are. But things become even more complicated than what I’m explaining right now. Because I focused on the components of a database system. However, do you know these kinds of database systems need to report it because it captures just a very small piece of the puzzle. In the data infrastructure of an organization. Think about this in the past. You used to care, mainly tables, in most organizations. And you used to buy Oracle, IBM or Microsoft and Bibles. novels were data infrastructure. It was just a single, colossal database system. There were no data engineers or data scientists, there were DBAs, the administrators and you were set, you would pay a lot of money every year, but you will be set. Today, the data infrastructure consists of hundreds of pieces of the puzzle. Right? You have aI also, it’s part of the world of young girls with data and as well, it’s just a different computation. It’s no secret. But again, in the context of the universality that we’re talking about here, it’s yet some other operators and different formats and different data. So now you also can ask what’s to give you more than just a sequel console, UCF Jupyter notebooks where you’re doing your data science, you’re building AI plans for tracing the data as a bit more advanced ETL. And before you shove transformations, you see so many different things. And it’s just in my mind, and loads of those components need to be built into the database system. Don’t think again, this is equally radical to the ideas I had in path about figuring morality and you know, having a single database for everything, but the database system needs to evolve. Otherwise, we’re creating way too many silos. One silo for the dashboards, one silo for the ETL, one silo for AI. In this also calls for building unnecessarily large data engineering teams, data engineers are extremely important but unnecessary. You’re inflating your data engineering. Things that will happen is that again, you end up reinventing the database system, but now it’s slow. worked on tables. Now it’s on your database system and on your ETL pipelines and on your AI and on your data flows. In Gen, you’re ending up reinventing the database completely. Because, again, you need access control. Again, you need authentication, again, you need a catalog to see what’s happening across your organization. Again, you need wrangling again, ETL. These are problems you’ve solved databases. So anyway, just a small note before I tell you where we’re going, because you’re gonna see some of those components intelligibly today, because talebi cones evolved even within its own evolution, he has evolved even more to capture, you know, this kind of fall faster. So let me tell you how it started and how it evolved. Because in my opinion, that’s the only sane way to build something like Chalkbeat, which is to log, especially the way we started, right? We nobody gave us a blank check 10s of millions of dollars to build it, we build it very organically starting with the match single person at the time, myself, and then, you know, raising capital, incrementally proving this crazy vision, attracting more capital, attracting more great talent to build this right, because I’m building this team. So the very first saying, decision that we took that I’m telling you about the decisions we have not regretted. Okay. Yep. And I probably want to hide in the decisions that were regretted, although I am forgetful, when, when times. After I learned my lesson, I moved forward. So I move on. So that’s the first on the only C willing to start this week was storage. Yeah, I focused on the first 18 months of building the code. So the first couple of years at MIT experimenting, and then the first 18 months, as a company, we focused exclusively on the storage engine, and built what we believe is perhaps the best format, in the space, we don’t talk too much about it that after my reasons, a, I don’t want to promote a format, I want to promote an engine, the engineer matters, not just the format, I don’t want to create to get another format, consortium and evangelize it. I want to tell you, here’s the library, forget about the format, the format is always going to be flawless. Here’s the engine, here’s the API, we’re gonna give stable API’s just using your ties in. So we focus on the storage, we build something extremely powerful, that has features that are necessary across domains like cloud native data layouts, and object stores with us, you know, piling mutability, so that we don’t have too much copying of files and objects, stores versioning, time traveling, amazing indexes. Acid guarantee is, insertions and deletes. And everything that we do, we should very specifically guarantee that it took a long time to get it right. But we did get it right. This gave us amazing performance for very different use cases, again, like genomics, images, and others. Because once you get those rights, then the tabular use cases become easier. Yeah, tables are very neat. They’re easier to capture, once you start operating a petabyte scale, and once you get the indexes right, and you again, optimize it, there is your request level, and the CPU cycle level, then the rest of them become much easier. So storage engine, a, we started at five customers, from the storage aspect alone, like on the open source, the word customers that our customers until today, you know, trust and that’s what we’re like for people. And that he really enjoyed the storage engine, because it followed a lot of the genomic problems, for example, see that, yes, that that library alone gives us back. As we were proving this out, we’re getting some customers and we are attracting more capital, we were more confident to start building the other layers. On top, we build the next thing that we did to build an inordinate amount of API’s and integration. So right now, the API is all fully maintained, fully optimized by us. And we started offering for example, SQL queries through which Presto it to Maria DB through spark. So that became a quick way. So we plugged into those systems and we said here, if you Yeah, if you want to do some SQL queries at scale, this is what you should be used to. And that was nice. an easier path than creating all those layers. I explained the right SQL parser you know, query rewriting their optimizer Zeki door like all of the all of those aspects of the system. So that was next and then we started realizing that one of the most important gains, easy access control locking the government subject. And also, if we wanted to build our own execution engine, as we are doing today, we need to start with fundamentals. So we have built our own service engine since day one, our NG has been serverless. And we were building it out. Because we knew that at the end of the day, no matter what query, you can, it’s going to be decomposed into either an operator tree or a task craft. And that task needs to be deployed in a distributed setting where each task is dispatched to a different worker, those workers need to be, you know, they need to be elastically scaled. And they need to have retries, they need to log everything, we need to monitor everything. So we built the primitive, we didn’t build a sequel operator without the Trinity. And that helped us solve a lot of problems again, in genomics in the imaginary era, and so on, so forth, again, slowly and gradually. Then we started building dashboards and building Jupiter environments, you know, more than one lap. And, you know, queries that become past craftsanity. More automatically. Bender ETL processes on Baldwin, all of those were built on the trimmings for the architecture was built in a very sane way, since the get go. And that’s why we have very little technical debt. In that respect. We reuse everything like we do, we don’t refactor, but we don’t already. And nowadays, we’re pushing more and more computers down to pile up. Because first of all, we wanted to be self-contained . Second, the computer moves closer to the data, so we minimize the copies, right? Everything’s zero coping can’t be we, again, we optimize for me, anyone can catch it, there are so many optimizations we can do. If we manage the data, from the time it comes into main memory, to reducing an hour. It’s much easier to control performance, and to optimize for performance. So this is what it is, there is a lot of work still to be done around, you know, optimize distributed computing, the more primitive push downs, but still, I mean, told me today that you can use it and get immense value for, for it very challenging problems and new sequel to distributed computations, use dashboards is Jupyter notebooks, and most importantly, federated your data like this is one of the killer, killer features. And pardon me. Okay, so from

Kostas Pardalis 42:45
What I hear, Dell DB started doing storage zones in as we said, like, solving first of all, like the problem of the fundamental of like, how we are going to write the bytes on the storages. And on top of that, build the rest of the BMS, right, like all the rest of the stuff that isn’t in there, like to actually execute queries, give access control, and all these things. So staying with the LS folks a little bit like on the first step, which is the storage engine. So the most, or one of the most, let’s say, well known, like storage engines have their rocks dB, right, which is

43:31
a key value store. Now, with key value stores,

Kostas Pardalis 43:39
Let’s say the mental model of how you interact with the data is like, super straightforward. Like you have a key and a value, right? Like, the API is quite primitive. So what’s the difference? If I have to convert like a system like groups to be right? I request a key or like, to get it to only give a key A devaluing gets stalled, right? So if I compare it with style DB as a student surging, what’s the difference? How do I access the data like dial dB? And is that like the API

44:11
difference?

Kostas Pardalis 44:13
The cereal is a bit more about, like the experience that I could have if I wanted to work with Stalis. Today, compared to something that I already know, like how he plugs, right. Oh, yeah.

Stavros Papadopoulos 44:21
Yeah, it’s kind of day and night that I end up really going to elaborate. So okay, Roxy, B is a key value store. Oh, you can model again, and loads of stuff are key value pairs. And this is very good for lookups. I’m looking for any quality query, it can be a quality query, right. And I can get back the blob. That is nuts to see. This is the use case that we share and I’m not like this. I’m going to explain a bit. I’m going to expand their baselines a bit, add, for example, Parquet, and all the variations on top of document dent on the iceberg and others. Get the ECF specification and of course, through an arrow, for example, you can get the engine. And this is where some people may be confused by Cazes specification. Yes, it’s a, it’s a format, and you may have multiple different implementations of it. They’re always one of them and don’t actually make the best one. So let’s focus on Arrow, Arrow implements the Parquet format, and others, including the Parquet format, I think Presler as well. So buy keys for tables, Roxanne B’s for key values. And those are the most predominant ones, that we use the multi dimensional radians. And this is a completely different level, like it’s, first of all, it’s more sophisticated. It requires much more work and thought to be put into this. But in order to understand why, okay, why am I so I’m gonna tell you the following. Think of an array as a shapeshifter. Depending on your problem, it’s gonna shapeshift into a two dimensional array into a three dimensional array into a dense array where every cell has a value, or into a sparse array where the majority of the cells not hidden, right, and they should not be stored because then the space itself might be infinite. So the different semantics there. And think of you know, the dimensions, the axis and the value and the dimensions. It’s an index, like an index. Mmm, a very powerful index, in that index allows you to do two things. The first is, again, in a shapeshifting manner. So you can really do bigger applications, you can lay out the banks on the file in the way that benefits are not your queries. It’s important. So performance is dictated by the proximity of the result bytes of the result bytes to each other on the file. If your result appears contiguous in the bytes base in a file, there are so many ways to bring it into main memory extremely fast. Yeah. But if you’re asking for a million bytes scattered in a big file, you may end up in the worst case doing 1 million requests. Yes. And the latency of each request is going to kill you. vra is very naive misspeaking, the arrays allow you to read the day, their proximity of the bytes on the disk with respect to your query workloads. And in most use cases, you know, your workloads, not 100% that you know that you know, my queries spherical, or my queries elongate in the Longquan Max. Trust me, in the use cases, we’re tackling, you know, your queries, more or less. I’m not saying that you’re going to hard-code, I’m just saying that you know the patterns, not the actual queries. Yep. So that’s one of the things that we are raised to believe that you cannot do that with a key value store. And because the key value stores are clashing, they’re hashing the values. And so if you’re asking for content, a contiguous range of values, those are not going to continue in a key value store, they’re going to be faster random places. Now, let me think about pure contiguous reviewing. So you can retain the spatial locality and then have the multi dimensional space near the single dimensional space, the same is more or less true for Parquet. It, I mean, okay, ha, you can allocate a little bit, you can partition, you can change the order, you can do some arty thing, but you have to hack it all the way to hack, nothing like it’s infused into the array model, you don’t need to think about partition, you don’t need to think about the orders. Tab B does that for you. So this in depth first, the layout of the bytes. The second is the indexing. In the dense case, this is very different from Parquet. For dense arrays, you don’t need an extra index, everything is in third the positions of the bytes on the desk. They’re in third with very simple arithmetic. Then instead of doing conditions A does the cell satisfy my group, does that cell satisfy my query, you know, a priori, the byte ranges that satisfy your query, collectively, the human minimizes the IO, and you minimize the mem copy requests in memory. You’re copying the data from you know, your temporary buffers into your result buffers classic boosts in terms of performance, but this is what arrays do. And that’s why they’re super, super powerful. You just need reasonable terms of arrays, not in terms of tables, not in terms of key values.

Kostas Pardalis 49:55
All right. So they summarize key values we are storing, again devalue rights lookups issues with locality that we want rages and all these things. We have columnar store ads that have the whole column. Let’s say we can tag it, as you said sorts partition the data. If we want to get the whole column, yeah, there’s going to be some legality there. But you do either of those three rights. Like, how do you do like Boeing squeeze, they’re like, the inserts and stuff like that. Let’s get, like, too complicated. And then we have erase rights. And I get what you’re saying about how the structure you can incorporate also like indexing in there. And like, further in the indexing and all these things. My question is, let’s think about it from an API, like purely an API perspective, right? Like, I’m a developer. Now, when I have a key value store, I know that like, I’m going to make a request. And we’ll ask for a key and I’ll get back at one. It’s in a columnar store, like I asked for column rights. And I would like to start iterating on the column, the values one after the other. So when I’m dealing with our rates, and as you said, like, there are like, different types of hearings, and you don’t have them, it’s not like one dimensional are only two dimensional, like, what I’m interacting with as developer. What’s the API look like when I’m using dialogue debate in my application?

Stavros Papadopoulos 51:30
Yeah, this is a question. Let’s make it again, a differentiation between the decoding of the API that the claw, ology is what I explained, by the way, Italic is also called Murthy, you need to think of palki B as Parquet on steroids, that comes to be thinking very similar in some respects. But it introduces stuff that you need to hack with market to get the partitioning, for example, you need to hack Parquet OpenShift, the partition, you need to cut Parquet, to subversion, you don’t need to do that with targeting, it’s embedded into the format itself. Like you don’t have to think that what I’m saying you should think of it as a generalization bucket. Now, the API is as follows. First of all, you can just ask SQL queries on Tality. Imagine that you want an array as dimensions that you may think of a dimension as a column. And think of the attributes. So other columns, it seemed like this, like at the IRA, think of the array as a data frame that is indexed, and the dimensions are special columns. And what you should be thinking about. And in this case, the dimensions are known materialized columns, they’re virtual. That’s how you should be thinking about it. Then SQL queries are valid. Think of them as different columns, differently compressed, very similar to what it is, okay? Now, that SQL is one of the API’s in Python Luminar, which have more Pythonic, and more Autolite API’s as well, for example, you can, you can use NumPy, live API’s photography, so you use the bracket operator, and your slides, you’re slicing. That’s all you’re doing. It’s just that Ali B also supports real dimensions, not only integer lifts, and string dimensions. So allow your bracket operators, you can also slice string ranges, okay, and you can slice real ranges. The API is like interfacing with a numpy array. And on top of this, we also build brands like API for we can have a data frame operator which is very similar to pandas. And on top of everything, you can add conditions to the non dimensions, and there are a lot of tricks we do to push those down. So this is very similar to error, you create a query expression posted down to totally be very similar to what you would do, again, with arrow or pandas or very similar data things like data from libraries. Okay, so it’s, it’s easier than you might think, you don’t need to think about the arrays, other than when you model your data. So you get the property. So once we do this, and in a lot of applications, we can just do that for you. You don’t even have to think about this. From that point onwards, you use what is familiar to you, the Pythonic API, our API SQL, it’s awesome.

Kostas Pardalis 54:27
That’s great. That’s great. Yeah, the reason why I mean C’s thinking is because I want you to know, like people are familiar with certain things. And when you introduce something new I think, if you can create you know, some guide to follow with what like people already know like, it’s really like helpful for them to understand what they’re dealing with and that the aid we are talking about like products or technologies or whatever you want to call them share that like they are primarily like consumed by technical people. Like engineers want to understand it like knowledge is an important part, like extracting knowledge from the process of using something like it’s an important part of the job. It’s been a while since many people, including me, have liked doing that, right? So that’s what I was trying to get out of this conversation. I know, we spent a lot of time on, let’s say, the low level stuff. But before I give it to Eric, because we’re also getting quite close to the end of the recording, I want to ask something you keep mentioning from the beginning, like genomics as a very important use case. And I would like to hear from you what makes genomics such a unique use case when it comes to working with data. So can you elaborate a little bit more on bugs?

Stavros Papadopoulos 55:58
Yeah, absolutely. In there. For genomics there. There are technically two different two different kinds of sampler decals. There is a population genomic DNA thought about like this. And then there is a single cell, which is mostly RNA. I’m oversimplifying, okay, but there are two different two different areas quite dissimilar, I would say it on the surface. From a technology standpoint, it’s identical. It’s all arrays. So I am gonna explain about this. In the end, although I’m gonna talk about genomics, the same ideas apply in geospatial because of geospatial data analytics, vertical forests.

56:37
And then please lead up just for the record, let’s not forget, I’ll do a lot of tabular use cases, and a lot of time series. Those are a little bit easier for us. The most significant ones are those that come in, in a very specific scientific, vertical, you know, life sciences, or geospatial, those are, you know, more scientific use cases than the typical business hieroglyphics started using tables. So here is what makes them very appealing to us.

Stavros Papadopoulos 57:12
And then, geospatial is very similar. So I’m going to focus on genomics. So the very first reason why I started working on this was because those are meaningful use cases. Right? Can we help hospitals save babies lives, it’s as simple as this, like, it creates a lot of purples. In our company, right? We’re solving super difficult problems. I know it might be a bit cliche, but it is the truth. The reason why it is appealing to us and why the other database was extremely difficult to break into this space, is because these spaces seem very convoluted to databases, folks like me, you really need to invest the time to understand the science. Even if you don’t understand what those scientists you’re dealing with say, there is absolutely no way that you can solve their problem. It’s impossible. They shared a lot of jargon, you need to understand this jargon. And then you need to dig into those very convoluted formats. The formats are crazy. They look crazy. They’re not, they’re not crazy. They look crazy. If you because they come in text today. Again, there’s a lot of jargon, multiple fields, seemingly variable length fields. Its implemented data is crazy. So you know, you really need to love space. Yeah, and you need to have you need to hire people that are experts, which is exactly what I did, right? I mean, I dove very deep into those domains. But I’m not a bioinformatician. I’m not gonna gather all this knowledge within, you know, the past couple of years. Bring the axon experts that understand it deeply. So

59:00
Bobby’s an amazing fit for those because those use cases are not purely tabular, there are always tables, always, this is very appealing for a system like ours, because we can definitely handle tables. Well, by then we can handle matrices equally well. Right, or even better. And those applications come with a lot of matrices, either in dense or spots and low and deep. And if you don’t have native array functionality, you really need to have your way. We say relational database systems in some only plane. Yeah, I can do it. You might, you will never get to the performance we’re getting. It’s very good. I’m not going to show you kind of the legacy why this is impossible, but even if you do it, it’s not gonna be worth it. Like it’s just when you’re going out of your way to do it. For us. It’s an add on. For others you need to go But to get out of your way to do this. So that’s why we started with those, again, by no means are we focusing exclusively on those. But they’re very good verticals, we’re working with super smart people, which we know. And we work very closely with those. We are solving something that has not been solved before. So we’re making our mark in those spaces. And from a business perspective, they’re looking at it, we’re working with very big arms and hospitals, you know, this is a very, very good space. So it can give us the growth we need, in order to accelerate them then expand on the other verticals, which we know. We can get to the other verticals easier. We started with the more challenging one. This is great. I mean, we need to have at least one more episode, because no question, we just scratched the surface here.

Eric Dodds 1:00:53
It’s actually bruxism here. And I’m the one recording. So I’m tempted to go along, but you know, we don’t want to get punished by Brooks too much, because he has a pretty tight ship. Okay, so we’re actually a little bit over time, which I love because bricks aren’t here. But let’s, I’d love to conclude with just a really practical anecdote. So you mentioned, you know, you mentioned earlier that. And actually, I think this is when we were discussing the show, before we started recording, but like babies in the ICU, you know, that’s like an actual sort of use case. Can you sort of bring you know, we’ve talked in such technical depth, which I love. But can you bring it home for us and talk about what’s happening in people’s lives? You know, who has babies in the ICU and how TTYL dB? I mean, I don’t want to get overly sentimental. But that’s a big deal. I just love to hear. Do you have a story about, you know, what this looks like on the ground for the people who are sort of the ultimate end consumer of the data?

Stavros Papadopoulos 1:02:08
Yeah, absolutely. Before I tell you exactly what we do there, I want them to be here, the people who are actually saving babies lives down the Pioneer doctors were talking to and they’re super pioneers, and especially Rady Children’s and Dr. Kingsmore and his team. And a lot of other partners that he is there. He’s accumulating. And they’re the absolute pioneers, because, of course, they know the signings once again, but they were so good. So perceptive to understand that their science is blocked by data management. In her case, the science is clearly blocked by data set by Qlik. The data is too big. The idea and again, I’m going to oversimplify and if somebody actually on the genetic side is listening in, I apologize beforehand, I do that on purpose, I just don’t want to add anything to the origin. But the idea is, you really go, it’s quite critical, sometimes always, to genetically test a baby and newborn when they’re born to find specific genetic diseases, that they put it that pretty much can destroy their lives if they go untreated, early on. There are specific genetic diseases that are treatable. But you need to take prompt action. In order to be able to treat those diseases, the very first thing that you need to do is identify that there is risk for such a disease, and you do that through DNA sequence. Now, in order to be able to identify whether a baby is prone or will share his genetic disease, you need to find the corresponding mutations in its DNA in the behavior of the DNA sequence. It hears the word data monitoring comes in. Someone makes it okay, just sequence the baby, find those locations and say, Okay, this variant, this mutation is going to lead to something critical funding. That’s fine. If you know,

Stavros Papadopoulos 1:04:26
variant is responsible for the genetic disease. How do you know that these very, are responsible for the genetic disease? Do you need to have a very big

Eric Dodds 1:04:38
thing, right? Yeah, cuz it’s not like a binary, is that a binary?

Stavros Papadopoulos 1:04:43
I mean, it’s statistics.

Eric Dodds 1:04:46
Right? Like it’s not a binary like this chromosome

Stavros Papadopoulos 1:04:48
repository. Exactly. Do you have a repository you say has a database table, which says these particular mutations are pathogenic, it will lead to something that But how did you do right off the bat? How did you create this factual table? That yes, this disease is going to happen because of it’s because you analyzed the million other bagels in the data from these a million other babies are huge dudes. So you need a database system, to be able to do analysis at that scale. In order to be able to always keep up to date with your table, this variant is going to lead to something so that we can make decisions at the LCD. Yeah, that’s how I’ll contribute to the space once again, all the credit to those pioneers, because they are all of these technologies new in this is truly the first time that genomics plays a big role in clinical use . Up until now it has been used mostly in research. But now we’re talking about clinical use. Yeah. And that’s why I really do respect the people that we’re working with.

Eric Dodds 1:06:01
Amazing. Absolutely amazing, truly inspiring. But I’d love to meet some of those doctors, maybe we can have one of them on the show, that would be kind of fun. But thank you so much for giving us your time, this has been such a wonderful journey of, you know, understanding academia, understanding entrepreneurship, understanding, you know, the deep guts of databases, and then ultimately, understanding how the ultimate manifestation of this can truly change lives, which, you know, is pretty incredible. So, several Thank you. It’s been incredible. And we’d love to have you back to continue the conversation. Absolutely, thank

Stavros Papadopoulos 1:06:49
you so much for having me, anytime.

Eric Dodds 1:06:51
What a fascinating guy, I think my big takeaway cosas is, you don’t often hear about, you know, ideas that arise, you know, sort of from pure curiosity collet maybe that’s an overstatement, because Travis is obviously working on, you know, real problems, but he also had a genuine curiosity to understand the relationship between, you know, storage, and how that impacts the functionality of all these other components, you know, of a database system, right, and the way that you query it, and all the things that you can do it. And I just really loved hearing about his curiosity, leading him to some interesting questions that ultimately led to interesting discoveries. Because a lot of times, the classic entrepreneurs story is, you know, I was sick of late fees, you know, Blockbuster and so I started a mail order DVD company, right, and then it became Netflix, right, and you’re responding to some sort of pain you or someone experiences. And so I love that, you know, he grew out of curiosity.

Kostas Pardalis 1:08:06
Yeah, absolutely. I think this is like something that it’s commonly found in people that they have done, like a career in research in general. Okay, to be honest, to go through graduate studies, and PhDs and postdocs and all that stuff, you have to be a curious person, rather than actually Curiosity has to be important enough for you. So you can keep riding through like the academia way of doing things. And I mean, like, you can tell that like, also, like, from the energy that the guy has made, like, she can get passionate, right. So I think that’s what comes together with multiple weeks’ plays, like he’s had them carry on before that. I mean, he’s definitely great. We can see that, right? It was, it was really fun for me, like, part of this conversation with him. And for me, it’s also very interesting to see how the B is going to mature and progress. As a product. There’s a lot of things that the team is building on top of like, DynamoDB as the storage engine as like, most people might know about it. So I’m looking forward to what the future is going to look like for them. And I have a feeling we will have him going in the show. I can in a couple of months, and he will have news to share with us. So I’m looking forward to that.

Eric Dodds 1:09:48
I agree. Well, thanks for tuning in. Subscribe if you haven’t, and we will catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 123:

What Is a Universal Database? Featuring Stavros Papadopoulos of TileDB, Inc.

January 25, 2023

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter