Episode 55:

Tables vs. Streams and Defining Real-Time with Pete Goddard of Deephaven Data Labs

September 29, 2021

On this week’s episode of The Data Stack Show, Eric and Kostas are joined by Pete Goddard, founding partner and CEO at Deephaven Data Labs. Deephaven’s query engine utilizes real-time data and creates a framework to make people productive with that engine.

Notes:

Highlights from this week’s conversation include:

Pete’s background in data engineering and capital market trading (2:10)
Comparison of the tooling from 2012 when Deephaven started with that of today (10:30)
Taking a closer look at defining real-time data (19:47)
Getting non-technical people, clients, and developers all on the same platform (36:11)
Deephaven’s incremental update model (40:25)
Kafka, timely data flow, and Deephaven (44:22)
Use cases for Deephaven (51:52)
Going to GitHub to try out Deephaven (1:02:43)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:06

Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Eric Dodds 00:26

But my burning question, I’m really interested. So Deephaven was built by a group of really smart people who came out of the finance industry. And I’m just really interested to hear about that. I mean, based on some of my past experience, like people from finance tend to be really good at technology, if they’re passionate about it in terms of software development, and building things. And things get pretty complex in finance. So I’m just really curious to hear the backstory. How about you Kostas?

Eric Dodds 00:26

Welcome back to the show. Today, we’re going to talk with Pete Goddard, of Deephaven. Deephaven is a really cool data technology–real-time streaming. It’s just really cool. I can’t wait to dig into the technical details.

Kostas Pardalis 01:13

Yeah, absolutely, Eric. That’s one of the things that I definitely want to hear about. I think it’s also like the first time that we have someone who has experience in the finance sector. So it’s going to be very interesting to learn a bit more about how technology looks in this sector. And so yeah, I’m really interested to learn more about these technical details, because I know that the guys in this sector are very obsessed with performance, and optimizations. So I’m looking forward to seeing the special sauce at Deephaven.

Eric Dodds 01:47

Awesome. Well, let’s jump in and talk with Pete.

Kostas Pardalis 01:49

Let’s do it.

Eric Dodds 01:51

Pete, welcome to the show. We’re incredibly excited to learn about you and about Deephaven, and all sorts of awesome data stuff.

Pete Goddard 02:00

Hi, guys, thanks for having me. This is a real treat.

Eric Dodds 02:05

Cool. Well, give us just a brief background on who you are and where you came from.

Pete Goddard 02:10

Sure. I’m an engineer from the University of Illinois. I bounced around a little bit in the Midwest and London. I went right from school, believe it or not, to a capital markets trading company in Chicago, this was at a time when it was a little bit unusual to do something like that. I traded derivatives for a bit. And since 2000, I’ve more or less lived at the intersection of trading and quantitative trading, algorithmic trading, and system development. So think of risk management teams and quantitative teams and computer science teams all working together, that’s really where I’ve lived. And then I founded a hedge fund that was based on that around 2004, 2005. And spun a piece of technology out of there in late 2016, with some founding engineers, and that’s what I’m here to talk to you about today. It’s called Deephaven.

Eric Dodds 03:03

Very cool. And I want to hear, I want to hear about the origin story, because I just loved reading about that before the show. But before we go there, what does Deephaven do?

Pete Goddard 03:15

Deephaven is two things. It’s a query engine that’s built from the ground up to be first class with real-time data. So real-time data both on its own, as well as in conjunction with static data, or historical data, things that one might consider batch data. So that’s the first thing, it’s really that data engine. But then Deephaven also is a series of integrations and experiences that come together to create a framework that makes people productive with that engine. These things are important because as you move real-time data around, as you move updating data around, the infrastructure to support that in the world is not as robust as with static data. So we had to build a lot of that tooling ourselves.

Eric Dodds 04:07

Very cool. And just just for the listeners, when we talk about, and we can get into the product more deeply, but when we talk about real-time data, updating data, moving data around, one thing that’s interesting as we think about just all the different context, tooling, etc., on the show, a lot of tools will focus on a particular type of data. Is that true for Deephaven? Or are you agnostic, and you’ll move any type of data? Is there a particular type of data that deephaven excels for?

Pete Goddard 04:40

Sure. So we came from a capital markets background, as I suggested, that’s sort of where I lived. When we built this tool. Originally for ourselves, though, certainly it has evolved a lot. It was always developed, and I think today it certainly stands as mostly agnostic about data and use cases. It can certainly support unstructured data. But it really excels at either semi-structured or structured data that is either updating or historical and static. So the easy ways to think about data in regards to that is, where does it live? And how can you get at it? Certainly, you can get data into memory via all sorts of methods. And we can talk about that. But there’s known technologies and known formats for data. And for us, that means Parquet certainly is important. All sorts of files, as you can imagine, and then the real-time stuff is: we’re in the world of Kafka and Red Pandas and Stalis, and Chronicle queues and, and things of this nature. So when we think of data, we’re thinking about those formats, and how do we interact with them? Not necessarily, what are the specific use cases that business users and their developer and quantitative partners are delivering on top of it?

Eric Dodds 06:05

Yeah, absolutely. Well, I want to dig into the technical stuff. I know Kostas has tons of questions. But really quickly, I’d love a quick story on the background. So reading about your story, and the birth of Deephaven. It was interesting to me, I used to work in the education space, at a company that taught software development. And a trend that we noticed was that people who came from quantitative finance really made incredible software developers and we can, maybe in another show, we can discuss the reasons behind that, because you probably have way more insight into that than I do. But when I read the origin story and looked at the looked at the platform, I thought, Oh, that makes total sense, actually, like, yeah, of course, these are the people who could build extremely high scale real-time, or infrastructure to support like high scale real-time, and all the use cases surrounding that. So you just want to give us a brief background on how you went from being a hedge fund to, you know, being a software entrepreneur? Sure.

Pete Goddard 07:18

Thanks so much. It is interesting, and I think in regards to the people that you cited, or the type of people that you cited, I think I’ve just been very lucky, frankly, in that I’ve only worked in that space. And I think I’ve been blessed to have very high quality engineers and really good people to work with. So I certainly echo your sentiments on that one. In regards to the origin story, chronologically, it goes something like this in 2004-2005, I was approached by a couple of friends of mine, that I to say, I respect that is probably an understatement, I admired them, to start a trading company with them a quantitative trading company, a market maker, actually, at its core, we spent five or six years really just in more or less high frequency trading most of it in the derivative space, the option space as opposed to stock or futures or something like that. In addressing that problem set, invariably, you have to build a team that is very good at a number of things, certainly, quantitative modeling, some sort of machine learning or, you know, predictive sciences, let’s say, as well as just system building, both from a fast software perspective, as well as from a good technology, think hardware and networking perspective. So we assembled a team and were really in that space for six years or so.

Pete Goddard 08:45

It’s an interesting industry to be in because it’s a bit of an arms race where you think you’ve built something that’s kind of interesting, and has an advantage, but you know, that your advantage won’t last very long. So in 2012, the partners and I decided that we should diversify, that we wanted to get into more scalable strategies, we wanted to move towards different horizons of prediction, and different business evolutions, let’s say. And so we looked at what we would need for that business, just like anyone else, so I’m going to diversify. What do we have to do to do that? Well, what are our advantages? How are we going to pursue this? And one of the keys that we thought about is data infrastructure. So really, that was the point that led us to the development of Deephaven right there early right around 2012 is when it first was seeded, let’s say as an idea. I can, if you’d like, tell you the criteria that we used at that time to define a system that we would want and maybe we could discuss how you think about that criteria, what your reaction to that criteria might be both in 2012, and maybe even today.

Eric Dodds 10:03

Yeah, I’d love that. Because I mean, 2012 is, I love that that was the year where the idea was seeded, because you have the birth of Redshift, I mean the very early innings for like the modern cloud data warehouse. And then, if you fast forward to modern day, the available data tooling is pretty significant. And so I’d love that comparison of what did you define then? And then, what does it look like now?

Pete Goddard 10:30

Yeah, and it’s, I feel so lucky to have to have the seat I’m in and to be able to witness this, this journey of the industry, both as an insider, or at least a wannabe insider, as well, as somebody who’s just learning all the time. And as you know, as you’re all aware, this space is moving very quickly.

Pete Goddard 10:57

At the time, I’m CEO of a hedge fund, right, a quantitative hedge fund. We say, okay, we’re going to do this business, we need a data system. And we really established what we thought were some pretty basic criteria. We said, okay, we want this data system to be a central source of truth. We want it to be one system, if it’s gonna be one system, we really, we literally want every single person in the company to coalesce around the system. You can imagine we had lots of different people, very, very few of them would consider themselves DBAs. Most of them would say, there, oh, I’m a developer or I’m a systems person, I’m a trader, I’m a portfolio manager, I’m a compliance person, I’m in accounting, like, we said, well, it’s gonna be data, we want all of these people around it, that was first and foremost. The second criteria was that we wanted this for all use cases that weren’t high frequency. So high frequency in the capital markets means at the time, let’s just say it was certainly sub-millisecond. Right now, it’s many orders of magnitude faster than that. But at that time, we wanted to support everything that’s not high frequency. So 10 milliseconds and higher. So 10 milliseconds to the last century, any data that’s within that span, we want this one system to be able to handle it. That was the second criteria. And the third is just simply we wanted it to be fast and to perform well. And we wanted that to be the case for basic use cases, like, I don’t know, looking up P&L or understanding position, as well as sophisticated use cases like using Python or Java to build a predictive model or do other neural network stuff. So those were the basic criteria we had. And so I can tell you what my reaction was then and what my reaction was now, but maybe you Kostas could volunteer like, what do you think? We wanted a system that lots of people could use that was good with both real-time and historical data. And that was a pretty high performance. How could you have assembled a system like that in 2012? Kostas, got any ideas? You’re the man on this stuff.

Kostas Pardalis 13:12

Oh, yeah, I can look at this. Eric, do you think you want to give it a try, or do you want me to?

Eric Dodds 13:18

No, I’m going to let you handle this. Because in 2012, you were working with whatever available data technology was there, while I was still… I don’t know if I was quite that deep into the stack at that point. So take it away. You can speak for both of us.

Kostas Pardalis 13:36

Yeah, yeah, absolutely.

Eric Dodds 13:38

I will say one thing though. It is really interesting in 2021 to think one phrase caught me. Well, I mean, super interesting criteria, but saying we want everyone to have access to the data. That sounds almost cliche in 2021. Just because I mean, it’s been just abused in marketing speech for a very long time now. But it’s easy for us to say, Oh, yeah, I mean, with the data tooling available today, like Yeah, yeah, that’s like, that’s not uncommon. But back in 2012, it really was. Like, that’s a pretty technical requirement, right? Not just necessarily like a cultural one. So anyways, that’ll be my contribution.

Kostas Pardalis 14:27

Yeah, my contribution. Like I’m trying to time travel back to 2012 and how I would architect a system like that. My first approach, at least would be something like what was called and I think it’s still called like a lambda architecture, right? Yeah. You pretty much have like a combination of streaming system with batching system and you are talking pretty much for two systems actually, not one. But it gives you, let’s say like, the flexibility to implement both use cases, both for like the streaming data and the real-time data, let’s say and like the batching data. Now, when it comes to the batching side of things probably should be at that time, if I was adventurous, it would probably be Spark. I think Spark was released in 2009, or something like that, Spark was young.

Kostas Pardalis 15:25

So that would probably be that. Now in terms of like, the real-time part. If I remember correctly, Kafka was released for the first time in 2011. But I think that … What was the platform from Twitter called? Storm? I think yes. So maybe I would probably use something like Apache Storm for the processing. That’s what I would do back then. Okay, now, what would happen today? That’s another story.

Kostas Pardalis 16:01

But it probably would be something like that, like a lambda architecture with these two main technologies as their pillars, one for the batching processing and one for like, the streaming processing. And of course, we are talking about two separate systems with whatever issues come with using this architecture, right.

Pete Goddard 16:17

You hit on actually one of the phrases that we’ll use sometimes, and that is lambda architecture. Yeah, that is very much what we delivered. The short answer of our experience in 2012, was that we looked out in the marketplace, and we were thinking, I’m the CEO of a trading company, I’m thinking, what vendor can provide this? Because I don’t want to build this from scratch; I want it now, not in a couple of years or something like that. And I don’t want to be stitching together a bunch of different technologies, that was not something that I was versed in at the time.

Pete Goddard 16:48

So we just found none. The reality is we surveyed the marketplace, and we just found no solution. Some of the technologies you talked about seemed relevant, certainly, but no solution. And one thing to remember is when you talk about Kafka, at least certainly at that time, and largely, Kafka is used as a phrase in a number of ways. A lot of time, when you’re talking about Kafka, you’re talking about the pub-sub system, not the data engine, right, that there are data engines on top of Kafka, but that’s not identical, and Confluent has one, but it is not identically what we think about when we think of Kafka. So the short answer is we rolled our own. We built this system, with a lambda architecture; we had a couple of breakthroughs. We think one was an update model, which allowed for incremental calculations rather than calculations on whole datasets. That is very, very empowering. Either from a compute perspective, or as you think of use cases that have complexity and a lot of pipeline logic, being able to incrementally see new data, and then just do small computes instead of massive computes is really quite a big deal. So we had a breakthrough in regards to that architecture. And then consistent with that lambda comment you made earlier, we created a unified model for handling historical static batch data, as well as for handling real-time streaming, updating, ticking data, such that users, whether they be people that are just writing table operations, or users that are doing sophisticated things, combining that API with other languages could really be blind to and agnostic about whether they are working on historical data, or real-time data, which is huge for a company that’s trying to have a very quick turnaround from explore, to productionize, to deploy. And that is really what defines a quantitative trading firm, being very good at finding new signals and then getting them out into the marketplace.

Kostas Pardalis 18:59

Yeah, that makes a lot of sense. Let’s go a little bit back. I know that you touched on some of the stuff that I’m going to ask you. But I want to ask these questions first. And because I have something very specific that I want to address with your product.

Pete Goddard 19:15

I feel like you’re setting me up. I like it. Let’s go.

Kostas Pardalis 19:17

Yeah. So you mentioned time, like, real-time. So real-time, it’s probably one of the most abused terms, especially in computer engineering and in tech in general, mainly because it’s very context specific, like how you interpret like real-time, right, for example, and this is something that like, let’s ask, for example, Eric, right, like in marketing, what would you consider real-time for you?

Eric Dodds 19:47

That’s a great question. Not too complicated, but I think it differs with B2B and B2C. I mean, in B2C, anything close to, well, I was talking with a company the other day, and they went from six hours to 15 minutes for real-time ecommerce analytics reporting on A/B tests for conversion rate optimization. And for them, that’s real-time, right? And I think from that standpoint it really is because you can’t really act on the data faster than that, right? You can’t look at results from a test and then deploy the next test in less than 15 minutes. And then I would say in a B2B context, I mean, for a lot of companies, getting your data updated daily is plenty sufficient, if you think about things like leads or deals going through the pipeline. So yeah, the sub 15 minute.

Eric Dodds 20:39

And of course, there are other use cases, right? Like there may be some recommendations engines that are running based on data science models for delivering a user experience where I would say that really that has to be extremely like the lowest latency possible. But that’s more about infrastructure delivering experience and less around, like marketing analytics data. Maybe that was more than you were looking for. But there’s my side.

Pete Goddard 21:02

I think that’s great. Because this is really where the rubber meets the road in regards to people like you talking to people like my team and me. In that, it’s really interesting to be able to talk about data infrastructure where we come at the problem set from two different perspectives, two different histories, really, but I think where I would hope we can get to by the end is perhaps we might both discover that there’s a new future where we become quite close to one another. And it’s a future that presents both a lot of challenges but also a lot of opportunity to companies like RudderStack with companies like Deephaven as well. And I think we can talk about that some more. But Kostas, you were about to ask a question, I didn’t want to get in front of that. Please.

Kostas Pardalis 21:57

So yeah, actually what I was trying to do is like, get different interpretations of what real-time means.

Pete Goddard 22:05

It’s a great question. And I’ll answer the real-time financial markets in just one moment. But I get this question now that I’m starting to talk to people outside of the capital markets, like their instinct is a little like yours, like, I’m not sure what real-time means or how relevant it is. But when we think of real-time data, the first thing we think about is not latency, the first thing we think about is the fact that data is changing. We see data as a thing as in flux. And there, Eric, I imagine you would agree, if you just put a pin in when do people need it? If I say, is data the same, or is data changing? You’re probably content to say that it is changing. So then I will just ask a fundamental question about your architecture. And that is, would you be open to an idea that perhaps your architectures should be designed with that fundamental in mind? Or do we have to insist that the future of data, even for lower latency, has to be that everything is a picture in time, everything is static data, where you have to look at only the whole universe of data all at once. We think, fundamentally real-time data is first, about just saying data is in a state of flux, and you need to architect your system accordingly. And if you really believe that, you’ve really narrowed your options quite a bit, because most systems are not organized that way. We think Deephaven is organized neatly that way.

Eric Dodds 23:42

And I think that’s one of those things that as you … Kostas, we keep delaying your question, I’ll stop after this, but when you ask that question, it makes logical sense. But for me, in my experience, it’s one of those situations where if you’re not asked the question directly, it’s hard to imagine that world, if that makes sense. Just because the world that most people I think have lived in when you think about real-time is just related to latency, right? That’s how you interact with, consume, plan, all that sort of stuff. And so it’s one of those things where it’s like, oh, yeah, I guess I need to think about that. And even some ideas are going through my head around, wow, like, I probably approach some things pretty differently if that were the case, right? If latency weren’t the first thing I thought about.

Kostas Pardalis 24:40

And I think Eric, now that we are talking about this, I remember, like at some point we were discussing on another episode, where we ended up with the conclusion that batch and batch processing at the end is just an approximation of a stream of data right and at the end everything when it comes to data exactly, because you have this dimension of time, right? Like everything is a snapshot in time.

Pete Goddard 25:04

Yes. And I think you’re really starting to speak my language here, Kostas, because we actually don’t only think about this idea that data is changing, but we actually think time actually really matters. And we think this dichotomy between relational use cases and time series use cases is a false dichotomy. We think there’s all these dichotomies that will separate the RudderStacks of the world from the Deephavens of the world. And we don’t think that they’re right. We don’t think data scientists and developers are different from one another; we think that it’s a spectrum, and there’s not two camps. That they have a lot in common. We don’t think applications and analytics are different. We think it’s a spectrum. And we architect systems, so that we’re really covering this whole spectrum, this whole, arguably continuum in these different dimensions.

Pete Goddard 25:56

And when it comes to real-time data, to answer your question, in the capital markets, it’s as weird of a word, weird of a question, as it is between capital markets and other places, because high frequency lives in single nine turnaround times. And, like jitter, within systems matters as much as performance of systems, right? And they’ll use that as real-time. Whereas some portfolio manager, let’s say, an algorithmic trading trader, and he’s trading in real-time. And he’s at 10 millisecond latency, just with his signals, and you’ll have an asset manager who rebalances his book based on factors in real-time, and he’s really doing it on a 15 minute cycle. So there’s nothing sacred here. I think we’re respectful of this term, meaning lots of things to different people. But we think anything below a few milliseconds, you’re really not in a general form data system. And anything above that you think we think you should be able to cover in a general form or general purpose data system and Deephaven certainly lives within that space all the way through any sort of static or historical data engine workload that you can imagine.

Kostas Pardalis 27:15

Yeah, I love that. Because at the end, you pretty much gave a quite similar answer, but it’s one from their own domain and at the end, so when it comes to real-time, the answer is, it depends.

Pete Goddard 27:30

Yeah, and here is what’s interesting, and I think this is, maybe something I’d like to put a bow around, right? One, I think, I’m guessing, I know very little about this world, right? I’m not at all an accomplished computer scientist the way you are. So I know very little about your history and your workflows, but I expect you come mostly from an OLTP world originally, right? Where transactions are, when you think of databases, at least over a year, the longevity of your career, transactions are really fundamental to what you think about right? And so it used to be just OLTP, then, oh, we need to do analytics on this, we need a second system that can do that, right. And now that’s evolved into all sorts of a variety of alternatives. But now, as you suggested, Kafka became a thing, right? This idea of event streams, and maybe not even just Kafka or, you know, its competitors. But think of, you know, API’s to vendors or different things you want to do to scrape the world or now there’s Twitter feeds, oh, there’s sensors on windmills that are out there. There’s all sorts of IoT, I want to analyze what the different telemetry is from my Apple Watch is telling me about the health of me as an individual or maybe healthcare providers are consuming a lot of those telemetric feeds from a variety of people and aggregating them and denormalizing them and trying to create baselines for different people.

Pete Goddard 29:03

Well now we’re moving into a pretty interesting space, right? We’re moving into a space where classic transactional workloads, as they were handled earlier, are not at the center, that maybe that is not the fat part of the future. When you think of the Gaussian of this distribution in say, 2030. It certainly has a role. There’s certainly some implementations where that acid is at the core of what’s important about this data. But now there’s many, many more use cases that all of a sudden become interesting. Kafka suggests that they’re pretty good at transactional load. Certainly a lot of transactional loads that are local, don’t need as much heavy OLTP. And now you’re in an environment, which is always what we thought from the get go, which is, hey, I want data in one place. I want it to meet software. I want that software to certainly be great at table operations, but I also want that software like just to be Python, just to be Java, just to be C++, just to name your language compiled down with the table operations to be performed. And that’s how we think of the world. And that’s where we think the world is going. And we built a system that is relevant, and we think serves a lot of those use cases today, and is certainly skating in the direction of the puck.

Kostas Pardalis 30:22

Yeah, yeah, absolutely. I mean, I think this whole approach of, as you say, like moving a little bit away from the transaction model with the database systems and start thinking of the data together with, like, the time dimension there. It’s something that it’s not just about data, it’s pretty much about anything. You can see it even in software architecture, right? You start listening about event driven architectures, of launching arrays, about micro services, and about data that is immutable, and you consume it, and you can replicate the whole history of what has happened, like with the service. So the main core ideas, they pretty much exist everywhere, and we see them being applied almost everywhere. And then, like super interesting, like for someone who is trying to not focus only on one side of technology, because okay, like, obviously, each one of us is hyper focusing on something and we see like the volume from the lenses of our work, but it’s pretty much everywhere. It’s very interesting. We had some episodes, when we were talking with software engineers who have pretty much nothing to do with data engineering or with data platforms. And they were describing how they architect their products. And so you would hear like similar terms used: events, event driven, pub-sub immutability, all these terms that we hear a lot about, like delivery semantics, like all that stuff that we hear from someone who worked like with Kafka like with data systems, you would hear those from people who are architecting actual, like, infrastructure for a product, which I found, like, super, super interesting. That’s one thing. The other thing that I really found interesting what you said, Pete, is about, everyone should have access to that, right? It doesn’t matter if you are like a data engineer, or if you are like a data scientist, if you are a software engineer, you need to have access to the data, right? And you, you need to have access to this data using your own tools and environment. And that’s very interesting, because I remember, like, I had a conversation with a customer a couple of days ago. And they were saying that they are trying to unify the platform that the data scientist team is using, and the data analytics team is using, because they are using two completely different systems. Right? The data analyst, they work on a data warehouse like Snowflake, and then you have the data scientists who have a data lake and they’re using Spark. And you’re like, even like these two types of people that are very close, like both of them, they work pretty much the same data. And even today, if you go inside organizations, you will see them being completely siloed from each other, which is crazy.

Pete Goddard 33:10

It’s interesting, because a few weeks ago, you guys had somebody on it who was kind of an expert, and he was talking about Snowflake versus Spark. And I think you guys did a pretty interesting comparison of those two right there. Certainly, it’s impossible in this industry, not to consider them heavyweights. And yet you look at what either one of them is able to provide and we do not see them as sufficiently unifying. I can tell you that emphatically an organization and an organization can be a company. But an organization could also mean a community like I think of like public policy here, and how compelling it could be, or even things like trendy things. I talk to my kids about sports and data around sports and how interesting it could be to coalesce people around platforms where lots of people can get the data in the business use case. We know how businesses work, right? It historically used to be like, you’d have DBAs, you’d have developers, you’d have quants, you know, or data scientists, those two used to be sort of the same thing. Now there’s something different, and then you had business people, whether it be analysts, or managers, or executives or sales people. We know that in 2021, every one of those people cares about data in a lot of companies, most companies, most white collar roles will have an interest in data and we think it is exciting. And soon it will be more than exciting. It will be a requirement to be able to coalesce all of those people around a single platform and not just to be like oh, it’s there, but to give them tools that they love. So the data scientist needs to be able to use Python, they need to be able to use all those libraries, it needs to be empowering for them, right? And yet At the other extreme, you need to have a non technical person that can write a functional script, can create dashboards for themselves that they can share out, can even launch applications with only a few clicks of a button, right? All of these are very modern ideas. And now they’re somewhat penetrating the industry, and many people are working on them. From our point of view, Deephaven really delivers them out of the box today, these things just work. So these are not new ideas. These are things that our customers very much rely on and things that we think our customers use to get them ahead of their competition, and to really deliver alpha and differentiators to their business.

Kostas Pardalis 35:48

Yeah and I have to say that one of the things that I really love when I visit the Deephaven webpage is that I can see the Pandas logo next to gRPC.

Pete Goddard 36:02

So you’re talking about, you’re talking about things I love now. So Kostas, should we get into the meat a little bit?

Kostas Pardalis 36:09

Yeah, let’s do that.

Pete Goddard 36:11

Okay. So one of the things that I think is important is that I’ll try and segue here, right, so one of the things that I think is important to understand is, though, the concepts we’ve been talking about today, are being more and more embraced in the marketplace and in the development community. Many of them are trying to apply them to systems that were not designed accordingly. Okay. So just saying, Oh, we want to have an update, we want to get non technical people, clients and developers around the same platform. Okay. For a lot of systems, it’s going to be very hard to do that. Because they have anti-SQL and transactions at the core. Right? So it’s going to be hard for a non technical person who’s already said, I’m not learning SQL, how are you going to force them to do that? For a machine learning person who wants to or a data scientist who wants to use their Python models, and compile them with the table operations and bring the code to the data to serve as complex use cases, how are you going to do that in a client server model, right? So the architecture matters, the infrastructure you’ve built, really, really matters. And so we’ve embraced these from the ground up. We build something new from the ground up to service all of these use cases, both for static loads, and for real-time.

Pete Goddard 37:36

And one of the things that we think is really important is moving data around, both within a cluster that’s running Deephaven, but also agnostically, to other other systems. So you highlighted a couple of the trendy but also, we think, really good open source projects that exist out there with gRPC and Apache Arrow. Right. So in particular, we’ve put a full embrace around these for all of the communication workloads that we talked about, that I just mentioned a moment ago. But here’s something to note, Kostas. And, and that is that they really are organized to or Arrow in particular is organized for static data, for batch data, right. And yet, we’ve talked on and off here in the last many minutes about the importance of real-time data or the importance of dynamic and changing data. Okay, so what we’ve done is we’ve written an extension of Arrow, specifically Arrow Flight that goes across the network that will support moving this type of data between applications and between nodes in an agnostic way. And in particular, for Deephaven, allows the data engine to consume it across the network and do the type of computational workloads I’ve described so far. So we think the world is coalescing around these technologies. We think Python really showed you something right, that if you have a technology that is good enough at a lot of things, and is easy to use, a lot of people will jump in. And I think that’s certainly something that I’ve observed here in the last 10 years. And we think there are communities that are really forward looking around streams and real-time data and the technologies around there around data science and the data transportation and in memory and on disk formats for that. And that’s really a part of the conversation that Deephaven is trying to enter.

Kostas Pardalis 39:44

Yeah, and so now you’re touching on technology that I’m a big fan of and I’m looking forward to seeing what kind of impact it will have because I think it will have a big impact and that’s Arrow, but I think that’s a discussion for another episode. Yeah. Okay, so a bit more about the technology itself, right? Can you share with us, let’s say the, like the three, four I don’t know how many they are, basic principles that the technology is built on. And that’s also differentiated compared to other technologies out there?

Pete Goddard 40:23

Sure, sure. So I think the first one I have already spoken of, and I’ll just reiterate it, and that’s that we have an incremental update model, which means that we’re doing a lot of work; there’s a lot of data structuring in the background and the system that’s saying, okay, data just came in, what does that mean to the state of, of the objects that we’re keeping, right. Fundamental to all this is that Deephaven thinks about tables as a very important thing. I mean, by everything you’ve said, tables feel very natural to you, and important to you, they’re very naturally important to us, we think of tables like data frames, right? However, though, we embrace tables, we think of streams, just as table updates. So anything in Deephaven, we’ve unified this construct so that anything you can do on a stream, you can do on a table, and vice versa. And you don’t really have to be privy to that. There’s data coming in to node sources, and you’re doing stuff with it, whether that data is real-time, or historical, whether it’s a stream or a table, classically, as others would think about it, you in Deephaven get to remove yourself from that duality, and just operate on your alpha.

Pete Goddard 41:34

At its core, Deephaven is a Java engine, but we’ve bound it tightly with C Python and NumPy, through a bridge. The bridge is an open source project called jpy that we’re working to support out there that allows, you know, bi-directional bridge between Python and Java, which means that you can deliver Python to Deephaven servers. And it just works, okay. And bringing this code to data really enables some of the complex use cases I talked about before, where you have a table operation where you’re doing joins, and aggregations, and filters and sorts and all the things that you would typically think and you’re decorating the data with new columns, but you’re also delivering bespoke functions, and third party libraries. And all that is getting compiled down to Java, whether it was brought as Java or C++ through the JNI, or Python through C, or through this jpy bridge. And at its lowest level deephaven is array oriented, such that this idea of moving between languages is cheap, because we amortize the cost and performance is great, because we don’t work on record by record, we work on array by array, it’s all a vectorized process, essentially. So these are some of the fundamentals of our design and our architecture and the system that is out there today on GitHub, as well as obviously, the gRPC API we mentioned before.

Kostas Pardalis 43:14

That’s super interesting. And a follow up question that has to do with how you relate or you can bear with two specific technologies. One is Kafka. And they operate very well with both at the same time, Kafka, at some point, at Confluent, they tried to introduce some primitives around tables, right? So you have like the traditional concept of an immutable log that you can build the stream on top of that. And that’s like the main with the topics. We also have the K tables and all that stuff that they introduce. So that’s one thing, how do you convert to that and like, what are the differences there? And another question is, we had the chance to discuss with the CEO of Materialize a few weeks ago. And again, there we have the interesting case of having a table that can be updated in real-time, let’s say by feeding into a stream. And this happens through technology that it’s called timely data flow, right? So, what are the similarities and what are the differences between these two?

Pete Goddard 44:22

Sure, so when we think of Kafka, again, I try to mentally divorce, Kafka, the transport pub sub system, arguably, ksqlDB or some other Confluent apparatus that is doing something on top of such a stream, right. So the contrasts are significant. I think, most importantly, in regards to the real-time data, Deephaven very, very much excels at joins in ways that ksql really doesn’t, right. So stream stream versus stream table versus table table joins are different things. If you’re to have stream stream, then you need to have windowing functions, right? That’s because they don’t have this incremental update model, which means that, hey, I’m going to join two streams. You need to tell me how much of the streams I’m supposed to look at. And then I’m going to do some batch joins, right? We think that that is a different model, and one that many of our Deephaven users for many of their use cases, wouldn’t be very happy about.

Pete Goddard 45:23

There’s other significant differences, like we have ordering as a very important concept, which is one of the reasons that Deephaven can present itself today as a very compelling time series database. Right? Kafka has no such thing. You and I just spent some time or at least you were listening to me, as I spent some time talking about the value of bringing code to data. And how that enables some of the very important use cases. Again, that doesn’t exist in Kafka. And then I think the last thing is, there’s a very, very interesting idea where there’s all this data in the world. And what I want to do is, or in my world, let’s say what I want to do is I want to join it together, I want to do some stuff. And then I want to just create derive streams, right? In Deephaven, this is a really compelling approach, right? So we have a functional language, somebody names a table, really, that could be a table, or it could be a stream, you write a little thing, oh, I’m just going to get the data, then I just take that to what’s called table one. And then the next line, I just get the right table two equals table one dot where and then I’m filtering it right. And then I can write if I want to, I could just keep it super simple. Table three equals table two, dot join with table ABC, and do some decoration and do some aggregation, right? In our case, you’re setting up a tree, is the way I might describe it. But you would know Kostas, that it’s really a graph an acyclic graph, right. And the new data is propagating through that graph, right, again, using our incremental update model at each node. And we’re doing this in a very performant and lazy way, so that you can grab intermediate results, or the end result. This is a very lightweight, easy, fast moving way for a user to take a bunch of, for example, Kafka streams, since you’re asking me to compare to Kafka and generate derive streams without registering schemas, or doing anything heavyweight, just in a very quick, fluid way. And we think that the workflows around that are a meaningful difference, not just what the engine can do, but how quickly a person can move. And in that type of accelerated capability, it really is important to business.

Eric Dodds 47:50

I love the arc of this conversation, because it touches on a theme that we’ve brought up on the show multiple times. And that is imagining this future world where the technical limitations around data are gone. And batch to streaming is one example of that, where data is happening in real-time in real life all the time. And batch actually is just a technical limitation, right? It was developed because it’s really hard to move data in real-time, right? Or historically has been in many ways. And it goes to what we were talking about with latency, right? Like people think about analytics, or even data and in terms of latency, just because that’s, that’s been our entire experience with data. And so I just love the way that you talk about Deephaven in a way that you’re thinking about that future, right, and trying to break down some of those limitations that have traditionally existed, which creates the opportunity for some new use cases.

Pete Goddard 48:51

It’s nice for you to say. It’s been really interesting for me because you have to remember I’m not actually qualified in the way that you all at RudderStack are. I focused primarily on capital markets stuff for a long time. And then we created this piece of technology, because we had business needs. If you look at some of the other systems that have been developed, they, you know, grew out of smart engineers that left Oracle or people that were grounded in Cockroach, and then they wanted to evolve the world. We came at it from, hey, we’re business users, what do we want our systems to do? And so we build a system from scratch. And that really isn’t my point. My point is, what’s interesting to me as an observer of all this is that, particularly over the last three or four years since we spun this out, and to be fair, we rebuilt it, essentially to meet the marketplace is a lot of the founding principles that are really important to us are resonating in the community and other people are building solutions that just map to the way we think of the world and the architecture we’ve put together such that we’re very inspired by the data science community. We find it compelling that somebody like Wes McKinney has gone from Pandas to Arrow, and has created a framework for data in memory and across the network that really makes sense to us. And we can elaborate on, we love gRPC, we rely heavily on Envoy, which is a proxy server that was put out by Lyft. We think that this, you know, it’s just so exciting, frankly, as an outsider that kind of wants to be invited to the table, to look at all this, and be fascinated by the fact that our vision is shared by other people. And immodestly stated, hey, guys, we might have done some work that’s going to be helpful here and that we can bring to the picnic in some ways like that. Let’s serve that up, too. So that’s really where we stand. And it is an exciting future, because it’s all going to move quickly and in unpredictable ways, as we know.

Eric Dodds 51:07

Yeah. If we think about going back to 2012 to today, and just thinking about such a fun exercise to think about, how would you build this in 2012? But if we think about the next decade, it’s going to be wild. One thing I want to do, though, is, the product is so interesting there, there are tons of use cases that I know have come to mind for a lot of our listeners. But I’d like to drive home how it looks like to use Deephaven on a practical day to day basis. I know you came from the financial markets, but a lot of your customers are outside of the financial markets. And so could you just give us a couple examples of some cool things that your customers are doing with the product?

Pete Goddard 51:52

Yeah, sure. I mean, so I think there are some industries that are ahead of other industries in regards to really caring about streams and pub sub systems and are dominated by that. And those are the ones that are for the most part first movers with us. So capital markets certainly think crypto, but also not crypto, but blockchain, right. There’s all sorts of data on the blockchain. I don’t want to just build analytics, I want to build applications off of that. Sure. IoT, telemetry, gaming, energy and power, these types of things. So I think Deephaven is this core engine, but it’s also the framework. I talked about all those experiences and integration. So some of the classics, some of what people will typically use us for in the early going is, as I suggested earlier, hey, I have a lot of Kafka streams, I want to marry to them to some CSVs, or to some Parquet, I want to do that in an interactive way, where I’m interrogating and exploring the data, we have a really, really easy to use, click a button and go console experience, or what we call a code studio, what others might call a repple.

Pete Goddard 53:04

So people will investigate data, like if you want to look at real-time data in a browser, you have a Kafka stream, I want to see it or oh, I have a parquet table and it has 2 billion 3 billion records, I want to see it in a browser–if you want to do either of those two things, there is not an option. The only option you have is Deephaven. And within a few minutes, you can be going and seeing that stuff, seeing that stuff and touching it. And interactively doing all this stuff. You think filter, plot right there from the UI, right? Not to mention, now all of a sudden, you’re also in an environment where, okay, I want to do more sophisticated stuff, like I want to join it, I want to aggregate it, I want to create new data here with different decorations. So bringing those types of things are classic.

Pete Goddard 53:54

People do data science, right? So oh I want to bring this data, and then I want to marry it to PyTorch or something like that, or I want to do even just statistical modeling on this data, particularly as it relates to data that is changing, as I’ve suggested, that’s a hard problem that, that for the most part we make easy.

Pete Goddard 54:13

And then the last thing is oftentimes, like a lot of what our users will do is they’ll say, oh, I have a non classic source of data. Oh, I’m gonna scrape a website, you know, here’s one, oh, I’m gonna, I’m trying to get a sentiment indicator for earnings calls is what a capital markets customer did for example. I’m going to scrape a website, I’m going to essentially inherit in real-time the conference call of the CEOs of all these companies right after they have these earnings, their leadership team, they had the earnings call, I’m going to inherit that transcript. I’m going to parse that transcript through classic Python libraries and I’m going to establish a sentiment indicator that then I can process as a signal that combines with other signals I have to tell me whether I should buy or sell something. And I’m going to do that in real-time as they’re talking. That is a classic model, you can do all of that in Deephaven. So anything else, you could do something like that, but one it would be delayed; two, you would have to use a client server model where you’re doing some table operations in one place and you’re pushing data around and copying it, you’re moving it right, you’re transforming it. We very much just allow people to, hey, just push that code was all going to get compiled together, you’re going to get this answer. You’re going to scrape the website right there, you’re going to handle the objects that are the data right there. You can deliver the sophisticated Python libraries right there.

Eric Dodds 55:44

Yeah. Can I ask you a question that straddles the fence between entrepreneurship, like SaaS entrepreneurship and technology?

Pete Goddard 55:54

Please. I’m not sure I’m qualified, but …

Eric Dodds 55:58

I don’t think I’m qualified to ask this. So I think we’re both entering uncharted territory. I’m sure there are a lot out there. So this is anecdotal. But when you think about SaaS focused on data, a lot of its venture backed, which kind of makes sense, right? I mean, especially when you think about database products, which take years to sort of mature and come to market and become generally available. And it’s challenging stuff, right. And so it makes sense to capitalize the business according to the go to market timeline. And it’s, it’s rare. One, I would think that it’s pretty rare, at least in the examples that I’m thinking of, for a company to spin SaaS out of another business, and for that to be a truly robust tool. And then I don’t know why this example came to mind. But it’s like the glut of project management tools, like a team’s like, Oh, we can’t find a project management tool that works. So we’re gonna build our own. And it’s like, yeah, the reality is like, they all have their limitations, and you just have to figure out how to deal with it. Right? And I’m sure there’s some success story of that happening well, but many, many more failures, right.

Eric Dodds 56:37

And so I was thinking about Deephaven and if we compare it to the project management type thing, and the problem is very hard, right? I mean, real-time data in capital markets is sort of the tip of the spear when it comes to complexity, quantity, requirements for real-time computation, variety of needs on processing or running your own custom code on the data. And so I’m interested to know just for you as an entrepreneur, do you think the difficulty of the problem contributed to the success of spinning a data SaaS product out of another company, which I think in itself, I need to say congratulations, because I think that’s quite a feat.

Pete Goddard 58:02

OK, well, there’s a lot there. Let me try to unpack it. So first, I would say that I appreciate the congratulations, our team would suggest that it is premature, and we have a lot of work to do in servicing customers. So we hope to earn that congratulations over time, but I very much appreciate it. The second thing I would say is in regards to real-time and capital markets, I think when we spun it out, we believed quite heartily, as you suggest that we were at the tip of the spear; we were handling, quite reasonably with good performance and with extreme flexibility, I would immodestly state, not some use cases, but a combination of use cases, a portfolio of use cases that were challenging, and that we were servicing them for really pretty demanding customers, internal customers, but customers. And we spun it out, because we believed what I told you earlier. And that is, wow, this is a really interesting problem set. And we think the world is going there, like we think that feeds were just the capital markets term. And now the whole world thinks about feeds, a Twitter feed is like a known thing. That’s data. Everyone knows there’s data and there’s money to be made. There’s problems to be solved. There’s questions to be asked of all sorts of feeds. So oh, the world’s moving to feeds and Kafka became a thing. So we were like, oh my gosh, this is really interesting that we felt expert at a world that was about to grow, and everything we’ve seen in the last several years since we’ve spun it out. And since we meaningfully re-architected the platform seems to reinforce that view. So that was very exciting.

Eric Dodds 59:58

Yeah, no, that’s super interesting. Again harkening back to 2012 and thinking about how common the word “feed” was right? It is all over the place now.

Pete Goddard 1:00:11

So when we spun out in late 2016, I think you made a very inside baseball observation that yeah, most stuff like this is championed from venture capitalists, and largely on the coast. And I think we were lucky, in that we had a series of investors that had seen the power of Deephaven, to really revolutionize an organization and to allow an organization to move quickly. And then we were lucky in finding a number of early customers, not not a huge number, but a number of sophisticated customers who pointed their development and data science teams at us, and were greedy. What that means is they said, we need the platform to move in this direction, we need these features.

Pete Goddard 1:01:04

So a lot of the ideas that I’ve expressed to you today are not even not mine, it’s certainly not mine, not even my teams, but rather their reaction to a pretty small customer base, but a pretty sophisticated customer base that really had lots of options in regards to technologies they could choose. But we’re choosing ours and asking it to evolve. So I think as we again, we think of the fat part of the curve, the belly of the bell curve in regards to data workloads, we think that real-time data is going to matter there. We think that coalescing a lot of different types of people around a platform is going to matter there. We think bringing code to the data. So you can handle complex use cases at the same time, as you have real-time use cases. We think handling relational and time series stuff together to deliver analytics and machine learning on the same platform as applications. Right? We think that that’s where it’s all going. And we’ve been fortunate to have a group of people, as investors that have believed in that, and a series of customers that have been very involved in trying to move us to the promised land.

Eric Dodds 1:02:21

So cool. Well, we are close to the buzzer here. But before we hop off, Pete, you’re open source, which we didn’t have time to talk about, sadly, because that’s something you know, I’m very passionate about. I know Kostas is as well. But if anyone in the audience wants to try Deephaven or explore what’s the best way to do that?

Pete Goddard 1:02:43

Sure. So you could go to our GitHub, which is just Deephaven on GitHub, there is, we hope, simple instructions for you to download our images and launch an instance and tutorials there that will hopefully show you or introduce you to the range of concepts that I’ve described here. We’ve both invested reasonably and are very committed to trying to provide good support, and are pretty dogmatic about it. We have lots of fights internally about whether something is easy enough to use, so we’re trying to be vigilant about that, and we are very much accustomed to supporting customers. And we see all of these community users as customers of ours and are dedicated to not just supporting their use cases, but really listening to where they think the product should move.

Eric Dodds 1:03:42

Pete, this has been an amazing conversation on multiple multiple levels, philosophical, technical, and even a little bit of business thrown in there, which I think makes for an awesome time.

Pete Goddard 1:03:54

Well, it’s been an indulgent experience for me, I learned a lot, and I very much appreciate the time you guys spent with me.

Eric Dodds 1:04:01

Great. Well, we’ll check back in with you in another six months or so and see how things are going.

Pete Goddard 1:04:05

Beautiful. This is great. Thanks, Eric.

Eric Dodds 1:04:07

What a fascinating guy. I love how many of our guests have studied things that are astrophysics-esque. And then Pete’s story is really interesting, because he went from there into the financial markets, which is interesting. I think one of my big takeaways was Pete’s challenge to me in terms of thinking about analytics through the lens of latency and pressing on that and asking why. And I just love that because I think it’s expressive of a mindset that doesn’t accept the currently available tech as status quo, which opens the possibility of imagining what can happen if you break down current barriers. So it just made me think a lot and I’m sure I’ll keep thinking about it the rest of this week. How about you Kostas.

Kostas Pardalis 1:04:58

Yeah, I think this part of the conversation was super interesting. And I really enjoyed that, let’s say the reframing of the term real-time from being around latency to being around something that changes. Right. I think that was like the most interesting part, one of the most interesting parts. And the other thing that I would add there is something that has started emerging as a pattern to our episodes about the importance of streaming data. And that’s something that we discussed today. And it seems that streaming data and data that they change often are becoming more and more important, and we build more and more technology around them. And what we showed today together with what we had, like, a couple of weeks ago, when we were discussing Materialize, I think we are going to see more and more technology and interesting products coming out that will be dealing with streaming data and data that they change.

Eric Dodds 1:06:01

Absolutely. Well. Join us for upcoming shows, dig more into streaming, and meet other cool people working in data.

Eric Dodds 1:06:11

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at Eric@datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 55:

Tables vs. Streams and Defining Real-Time with Pete Goddard of Deephaven Data Labs

September 29, 2021

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter