This week on The Data Stack Show, Eric and Kostas chat with Ryan Wright, philosopher-CEO at thatDot. During the episode, Ryan discusses all things graph databases, from use cases to scalability and more.
Highlights from this week’s conversation include:
The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 0:03
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.
Kostas, we are going to talk today about graph, which we haven’t talked about in quite some time, I think near for J was the last one that we talked about. So I love bringing up subjects that we don’t cover a whole lot. We’re going to talk with Ryan from thatDot, they’re the company behind an open-source technology called Quine. And really my question actually, because he talks a little about graph is just defining it. And then understanding from Ryan, where it fits in the stack, it can be used for a number of different use cases, right? I mean, literally software development, and building actual graphs, but also queries and insights and all that sort of stuff. So that’s what I’m going to ask, how about you?
Kostas Pardalis 1:10
Yeah, you’re right. Like, we haven’t had many opportunities to talk about graph databases, and our graph database have been around for quite a while. But we don’t hear about them doc miles outside of like, get MC know, for j, which is probably the most recognizable one. So it will be super interesting to hear from Ryan, like, what’s making starter projects? Why we need graph databases, what’s new about like, the system that he has built? And as you said, like how it fits to the rest of the data infrastructure out there, because it sounds like after that base have been like a bit niche. Yeah, that’s life run analytics determines Yeah. Well, at the same time, like forget that, we have stuff like graph to where and which gate about analyzing data. But in the front end, development space, it has been very well adopted. So it will be interesting, like to hear from humans to like, what’s next? And what’s new and exciting about graph databases?
Eric Dodds 2:25
I agree. Well, let’s dig in and talk with Ryan. Ryan, welcome to the datasets show. We are so excited to chat with you today.
Ryan Wright 2:33
Thank you, Eric. Great to be here.
Eric Dodds 2:35
All right. Well, give us your background, you have a super interesting background as a data practitioner, entrepreneur, but tell us about your background. And what led you to thatDot.
Ryan Wright 2:47
Yeah, sure. My career has steered in the direction of data platforms and data science. So it’s really as a software engineer, creating data pipelines and creating machine learning tools and other analysis portions to help answer this question about, we’ve got this high volume data stream, what does it mean? So in my career, that has kind of been the arc that has been guiding a lot of my technical work. And so that has led me personally through positions as a software engineer, Principal, software engineer, Director of Engineering, I’ve led research projects as well focused on creating new technologies, new capabilities. So principal investigator on DARPA-funded research projects, and kind of the constant thread through all of that is this data question about? Here’s a bunch of data, what does it mean?
Eric Dodds 3:43
Absolutely. And tell us a little bit about thatDot what is thatDot? And why did you found it?
Ryan Wright 3:50
So that is a young startup, that we found it to commercialize a technology called Quine. So Quine is the world’s first streaming graph. It’s an open-source project that was just recently released. And it’s been getting great community feedback. And thatDot is the company behind the commercial side. So providing commercial support some extra tools on top of it for managing and scaling it and just running it in large volumes of enterprise environments.
Eric Dodds 4:20
Very cool. And tell us, okay, tell us about Quine. So it’s recently released, where did it come from? And then it’s an interesting name. Before the show, we chatted briefly about this, but tell us about the name, or how the name Quine came about?
Ryan Wright 4:35
Yeah, so from a technical perspective, Quine is a streaming graph that kind of lives in between two worlds, the world of databases and data storage, especially something that looks and feels like a graph database, but it’s really aimed at this high volume data streaming use case. And I am like I was describing just themes in my own background. When you put those two together, the big Question is, here’s a high-volume stream, what does it mean? That meaning question is underlying my philosophical background, I have a bachelor’s degree in philosophy and just could never shake that bug afterward. Because there are some really interesting, deep philosophical questions that have really nice tie-ins to modern data problems and modern software engineering. So there’s this really old question in the history of philosophy about how does a word convey meaning? Not, what does it mean? Like, how does it go about doing that process of conveying meaning? And there’s a long history behind that question and a lot of deep thought on that question. And there’s just a really striking parallel to the data question. So if, if you’ve got a stream of data, and you think about each record in your data stream as a word, and put all those things together, and you’ve got something a whole lot more meaningful, you’ve got a real nice comparison to this long-running question that has a lot of deep thought behind it. And so the name for this project was really the synthesis of trying to say, there’s, there’s this age-old question about how words convey meaning. There’s this very modern, urgent question about how a stream of data conveys meaning and what it means. If we put those together, we can kind of leverage some thinking on both sides to do something new and really advance and move the ball forward in that case.
Eric Dodds 6:31
Super interesting. I love it. Well, let’s actually go back to basics. So a lot of our listeners, I know, are familiar with all sorts of data stores, but we actually haven’t talked about graph databases on the show a whole lot as a subject. And I know, you said that Quine is a technology sort of looks and feels like a graph database, but sits in between two worlds, I want to dig into that. But could you just give us a 101 on graph databases? What are they? What are the unique characteristics? And generally, how do you see them use? Because they’re not necessarily new, as you’re talking about before the show? So yeah, give us the graph database 101.
Ryan Wright 7:14
Yeah, absolutely. So a graph, if you close your eyes and imagine a whole bunch of circles swimming around, some of them are connected with arrows. That’s a graph. So the circles in the graph are nodes, and the arrows that connect them are edges, it’s a way to represent data, especially when on each of those circles, you can put properties, which are just C value pairs. So imagine, like a map or a Python dictionary, or something like that, on each one of those nodes in the graph. And so what you say about a node is really two things, a node has some collection of properties, and it has a relationship to other nodes. When you put that structure together, then you can represent literally everything that you represent in other ways. So in a relational database, or no SQL database, or Tree store, key-value store, you can represent exactly that kind of data in that graph. But then the graph gives you a different perspective on it. And it turns out, it gives you some superpowers for working with that data. So that the different perspective comes from the sense that in a lot of ways, graph data and graph structures feel like there’s something different. But in truth, it’s more. So you have data in a relational table, you have data in a key-value store, there’s a relationship between those in that relational tables are a bit more expressive, they’re more powerful you can, you can join tables together, you can run queries across them, you can talk about the relationship between values there. So in the same way that there is a relationship between key-value stores and relational tables, there’s that same relationship between relational tables and a graph. So they’re actually on a progression that gets more and more expressive as you go further down that list. And so we internally we have, we have this picture we talk about among our team, we call the kite. It’s just this kite-shaped relationship about different data stores, where if you start with a single value down at the bottom, and you ask this question about when I have a whole bunch of those, what structure do you get? You get a key-value store type of structure, it’s a list or a set, something simple like that. But what do I have, when I have a whole bunch of those? Well, then it’s a table structure, or it’s a tree structure, just depending on how you choose to represent it. Those are equivalent. Okay, but what do I have on I have a whole bunch of those? Well imagine a bunch of trees that overlap and intersect. That’s when you’re working with a graph. When those leaves in a tree when one leaf shows up in multiple trees. Then you’ve intersected your trees, and you’re working with a graph. So there’s this progression through a lot of traditional data storage and data representation technologies, that that gets more and more expressive. As you move up that progression. At the end of that, that process, you arrive at a graph. And Quine got started, in part because we could have kept that thinking going, and asking that question to say, Well, what structure do I have if I have more of those? And when you get all the way to the end, and you’ve got a graph, and you say, well, here’s this soup of nodes and edges all interconnected, connected together, I’ve gotten treat that as one thing. When I’ve got a bunch of those, what do I have? The answer is a graph. When you have a bunch of graphs, you have a graph. And so the fact that there’s this, this mathematical pattern, even that walks through the history of database evolution from simple stores, like key-value stores, to relational databases, and, and tree-structured NoSQL data stores all the way through to graph databases, that question of this mathematical progression through those things, it gets to the graph at the end, and it stops. That was just so suggested to us, we thought, we’ve got to explore this all the way. And so that was the thing that first got me hooked on graph data and graph structures, that alongside of working with some real practical problems around working in various startups at the time, and we’ve got configuration problems about it’s hard to get the product, the product configured and out the door for all these different customers, some things they have in common, some things have to be separate, teasing all that apart. And understanding how to define that customer configuration really drove in the direction of saying it’s got to be a graph, we’ve got all these trees that are overlapping each other. So teasing apart kind of those practical questions, also, then steered in this direction of a graph. And so so that’s what initially led to some of the ideas behind Quine and then dove into explore all the way.
Eric Dodds 12:10
Very cool. That’s such a helpful analogy, I love the kite analogy, and then sort of the spectrum of expressiveness. Now, some of the use cases for graph, I think, are pretty obvious. Anyone who’s worked with data, right, so identity graph, where you have nodes and edges, and you’re sort of representing relationships, and all that sort of stuff. But where, so let’s just say we have kind of a standard, “vanilla” data store, set up in our stack, right, so you have a warehouse, and you’re storing structured data in there, and it’s driving analytics, and all the other stuff that you do with tables and joins, as you mentioned, let’s say you have a data lake, and you have sort of like unstructured data storage there maybe that serving data science purposes, etc. Where does graph fit in? For someone who’s sort of working on a data team, like, Where does graph fit in? Because I think what’s interesting to me, at least when you think about graph, like there are their use cases across the stack, right? I mean, in software engineering, of course, to represent relationships of users say, but then also to your point analytics, and sort of discovering meaning from data, as well. So, graph is kind of can be a utility player in the stack. So help us understand where something like Quine or thatDot would fit.
Ryan Wright 13:40
Yeah. So, like we were talking about before there’s kind of this spectrum of complexity or expressivity, where you move from single value stores or key-value stores, all the way up through relational and NoSQL, up to up to graph structure. So there’s one spectrum about complexity. So a graph helps you answer more complex questions. And what I’ve seen in the industry, and what really led to Quine’s creation is this other spectrum about data volume, like the speed at which data has to get processed. So batch data processing has been the standard forever. And still, it’s plenty of batch processing still happens. But increasingly, the world is moving more and more towards streaming data problems, which means real-time, one record at a time as fast as you possibly can, and it’s never going to stop. It’s an infinite stream. So that second dimension about the spectrum from batch to streaming. When you put those two things together, you’ve got a Simplicity vs complexity, say on your x-axis, and you’ve got speed or batch versus streaming on your Y-axis. So, what we found is that, of course, what you want is the top right corner there, you want the complex real-time streaming processing. That’s what we want. So be able to answer complex questions, especially for run analysis questions about here’s my data, what does it mean, as it flows through. But there tends to be a pretty natural trade-off, there’s this, there’s this barrier between what we’re working with and in this realm, and what we want to get to in that top corner. So you usually have to trade off how much data you’re looking at, how much do you hold in memory? How fast you can store it to disk and pull it back in? Again, a lot of like, just classic data processing and data storage questions that usually force architectural dishes decisions into a compromise. So the compromises either Well, we’ve got a complex question. So we’re going to use an expressive tool, like a graph database, to be able to describe and draw that complexity of some pattern that is at the heart of our critical use case. But if you want to do it fast, then well, you can’t really do that graph databases have been notoriously slow for decades. So they’re cool but too slow. So you get pushed in the direction of well, let’s stuffing into key-value store, and let’s build a big microservice architecture. Let’s let that be the graph. So as data flows through that architecture, it basically becomes this graph-structured system of processes as that data flows through to build the data pipeline to try to build this custom bespoke one-off for my company data pipeline. And we’re going to try to build that into making the thing that can handle high volumes of streaming data. So all the way at the top with complex answers to our question. So all the way to the side. And a lot of data engineers spend their life in that world and making those trade-offs. So we need it to go faster, we need to scale it. We need it to be horizontally scalable, faster, faster, faster, it never ends. But we need to put all this data together, we have to understand it in context, we have to know how this piece of data relates to that piece of data, and be able to look for patterns that are built up over time as that data streams in. So that’s really where Raph has fallen short and a lot of ways. For anybody who’s working on data pipeline tools. A graph is the right representation for anything that starts to be complex. But they’ve been too slow to actually operationalize productionize. So they’ve kind of been relegated to a smaller toy problems in this space. But if you’ve got to build something real and big and significant than you can’t, they’re just kind of off the table. And you’re gonna have to go build this complex microservice architecture to simulate a graph built out of a bunch of key-value stores for speed.
Eric Dodds 18:10
Super interesting. Okay, I’ve been hogging the mic from Kostas, but one more question for me. Could you give us, as a follow-on to that, just a real-life example of a challenge, a real data stream that a company has and a problem they’re trying to solve and how Quine fits into an example architecture?
Ryan Wright 18:38
Yeah, we’ve been having an unkown out of our ears lately for the fraud tends to show up very often, very commonly, as some form of a graph problem. And it depends a little bit on what kind of fraud you’re talking about. And so authentication fraud for logging into a website, there are lots of good tools and good products available for signing it. And, but when something doesn’t quite go along the happy path, you’ve got, say, bad guys out in the world who are trying to guess the password or they have a password list. And so they’re going to do this distributed. This authentication attack, basically, yep. Well, that comes all in as one big flood and then it just dumps on there. That’s pretty easy. Rate limiting has that covered, you’ve got that problem solved, no problem. But attackers are smart, and they’re gonna keep moving and getting smarter. So now they know, slow it down, and spread out your attack, have it come from multiple sources in multiple different ways, multiple different regions, lots of IP addresses. Each one of those factors then lets them hide in the deluge of other data that’s happening behind the scenes. And so to find that kind of authentication fraud of someone who’s trying to gain control of saves a particular account with special privileges. If they gain control, it’s a big loss a problem. So attackers are motivated. But to detect and stop, that means you have to start assembling the graph about all these attempts that are happening over time where they came from, who they’re targeting. And these failed attempts that fail, fail, fail, fail, fail on this one particular account. And then they succeed. And that pattern shows up in this graph structure. As you start connecting each of those authentication events, you start connecting them together, in it forms this, this often beautiful looking graph that then just leaps off the page, as obviously, here’s the pattern of a password spraying attack coming from so many different angles, low and slow, so that it’s not triggering the easy alarms. But then when you piece together all those attempts, you can clearly distinguish them from the real user, and pull those out. And if they gain access, you can find it, see it and stop it right away. So that’s, that’s one example of a fraud use case, we’ve seen other kinds of fraud use cases as well, like transaction fraud. So as money’s being spent, financial institutions have to do something similar about considering other factors, what’s the geography? What’s the other most recent purchases? Who signed for all sorts of different transit factors, what’s the category of the purchase? All these other sorts of factors that start relying on other values and other kinds of information that isn’t so readily usable, with a lot of other tools. But when you put it into a graph, it draws this picture, which then leaps off the page and tells us very clear story. That right now, here’s a case of someone who is using a stolen credit card number to buy something.
Eric Dodds 22:00
Super helpful. All right, Kostas, please. Take it away. I’ve been hogging the mic.
Kostas Pardalis 22:06
Yeah. Thank you, Eric. That was super interesting, Ryan to hear. Like all the stuff you’ve had to say about the ideas behind Quine. I would like to start by asking you something about graphs. And I want your opinion on like, you got, we can approach graphs as a way to ask questions. And as a way to represent data, right? You don’t necessarily need to have both. You can have, let’s say, a graph-based language for querying the data and translate that into like a relational model, what’s the ends on the back ends? Or even like a key-value store? Right, like, let’s, we see, like, quite a few companies doing that, like with Graph QL? And like having something like Postgres becames. And then obviously, again, you can have, let’s see, the way that you use graphs as a way to represent also data in like, much lower level and in terms of weights, then the database system that you have. How, how do you see that and like, weed spot for us? Like, the importance? And which part is? Probably both? I don’t know, like r&d minded. That’s part of like Quine.
Ryan Wright 23:32
Yeah, so an interesting question, and there are probably two sides to the answer. I think the preview is graph applies to both, but it’s in different ways, maybe significantly different ways. In one sense, the, the way that you ask your question can be thought of as a graph. But, but, yeah, so whether this Graph QL or otherwise there are query languages set up or graph queries. So you can use the cipher query language, the Gremlin query language, there’s an initiative to try to create a standardized graph query language. Because ways to express your problem very naturally fit into a graph, because that’s really how humans think, a lot of what we think about and just the way language works, the way our mental model of the world works, is reflected in this node, the edge node pattern repeated and connected. Because that same pattern shows up in our language. So subject, predicate object, and the lion knows Kostas, I’m a node, your node. Our relationship is that edge that connects us. So that’s, that’s how you build the social graph. This new users connected together by who knows whom. And you can frame questions as As a graph, in using the graph query languages, or even just more naturally using natural language, which is naturally graph-shaped. So on one side, there’s a, there’s a, there’s a good argument and analogy for graph-structured questions because they fit the weather we talk. On the other hand, when you go to write programs and write software, that’s where the software engineers need to turn it into something linear, some code that we can execute. And there are other good reasons to then headed in a different direction on that question-asking side. So it’s not exclusively that and even then, a lot of graph queries literally turn into something that looks like SQL, like in the cipher language, it’s very SQL-like. So you’re writing out this linear progression of a query? And it’s gonna go to query a graph, but you’re expressing it in something that is this query structure. So the question cited kind of comes in and out, and it’s maybe more conceptual than it is literal. For the graph view of the question nesting side, what we found is that the really the secret sauce for Quine and what makes it different and special, is in the grass runtime. I mentioned before that graphs have been graph databases, especially they have been known to like really slow and lethargic in terms of what they can handle. And that limits there, their application and developers to kind of dive in and use them so often. And I think because of that, there’s kind of develop this sense in the industry that graphs are slow. If you want it to be fast, you got to use some other tools, graphs themselves are slow. What we found through years worth of research is that that’s kind of an artifact of the graph database mentality, that data is gonna be primarily stored, it’s gonna sit there at rest, and you’re gonna bring a query to it and occasionally pull out an answer. So that database mentality about static data stored on disk, has been a limiting concept for what graphs can be. And Quine, the reason that lives in between two worlds that lives in between that database world and the stream processing world is because really, at its heart, we built Quine setting, what if we didn’t automatically adopt all the database assumptions, we’re gonna have to confront the same fundamental challenges. But what if we started from a different place? Let’s take that graph data model. And Let’s marry it to a graph computational model. So for us, that means an old idea from the 1970s, the actor model. So this asynchronous message passing system that has really become the foundation of stream processing, to build reactive resilient systems means to have this distributable scalable actor implementations that allow you to send messages as a way of, of record of doing processing and representing data in that system as well. So Quine builds the graph where nodes get backed by actors under the hood. And so they’re independent processes that can you could have 1,000s, hundreds of 1,000s, millions of them running in a system, you can have a lot of them moving, you can distribute them across clusters. And all communication happens through a synchronous message passing. So that as the framework so that that much more stream processing kind of approach that lets us then incorporate other stream processing considerations like backpressure, into how queries are executed, and how data is processed. So building a graph system that isn’t automatically adopting the old school ideas of a database, but is built for the modern world of high volumes, streaming data, as the first-class citizen we’re really trying to work with. So so on the backend, what we found is that, when you bring that kind of new approach to the runtime model, it can unlock some pretty stunning efficiencies. It gives us the opportunity to do graph computation at just the right moment. The ideal moment when your data is in memory, it’s already there. And then it turns out the graph structure behaves like an index. You’ve got a node here and you want to know well what data is related to this, what should I warm up in the cache so that it’s ready to go for high-volume data processing? Will a very natural answer to that question is warm up the nodes that are connected to the one you’re talking to? That one you have an edge connecting to this node in question, warm up its neighbors.
Kostas Pardalis 30:03
That’s super interesting. And okay, so if we have, I mean, a key-value store, I think it’s like, pretty clear, like, once the data model is there, like, do you have keys and you have values? Right? Okay, things can get a little bit more complicated if you also allow some nested data there. But I think like pretty clear to like everyone’s mind of like, how do you model something like that how you create, like a key-value store, right? How to design it. Pretty much the same also with relational databases. Like, you have the concept of the schema, you have the table, the table has columns, the column columns, like have a specific type. And then you think in terms of like relationships, and how one table acts relates to the other and globally, all that stuff. Okay. So there is like, I think it’s okay, like pretty common knowledge of like how someone can design the database to drive an application or like, do some analytics. How does this work with Quine, like, let’s say, I want to use Quine. And probably I already have some data sources, that they already have a schema. Some of them might be relational. Some of them might be like your article like documents or whatever. What do I have to do to define this graph? Or is like Quine figured out on its own? Like, how does it work? Like, how do I go from all these different data they have out there, too. I consistently updated with specific semantics that I understand graph on Quine.
Ryan Wright 31:48
Yeah. So there’s a two-step process to using Quine. And let me kind of preface that by saying where this will fit is in the pipeline in the data pipeline. So as data is moving through your system through your data pipeline, Quine plugs into Kafka on one side, and then plugs into Kafka on the other side. So it kind of lives in between two Kafka streams as being really a graph ETL step. So to help kick what’s in that first stream, combine it, understand it, express what you’re looking for, and then stream out what will hopefully be a much lower volume, but higher value set of events coming out of Quine, and into the next Kafka topic. So to use Quine is to first answer this question about where’s my source of data, plug it into a Kafka topic or a Kinesis topic, or pulsar, some streaming system that is going to deliver this infinite stream of events, plug it into that, and aim to stream out meaningful kind of distillations of what’s coming through your pipeline. So the detection of a fraud scenario, or 10 graph from cybersecurity use case, or the root cause analysis of your log processing or whatever the use case may be. So those sources of data get built into a graph with the first step, the first step is defining a query that takes every record and builds it into a small little subgraph. So one JSON object comes through that stream, that JSON objects has a handful of fields in it, maybe they’re nested fields, that’s fine. But the first step is to using Quine is you write a cipher query that says, I’m going to take in that object, and I’m going to build that into the small little subgraph, it’s like a little picture, a paint splatter or something, there’s a note in the middle of maybe a couple things coming off of it. Yeah. That tends to be the shapes that it gets built into because we can take that JSON, and we could just say, make one node out of it. So take that JSON, make one node, the value in our graph is when you start connecting it into other data. So we’ve got that JSON object stored as a node, a disconnected node and acquired graph. But what do we want to start intercepting it with? Pull off some fields from that JSON object. If there’s an IP address in there, pull that off and use it to create an edge to another node where that other node represents the IP address. That way, if you end up with two JSON events somewhere in your street, and they both refer to the same IP address, they both get connected to the same IP address. No. Maybe same thing with URL or user name. You can fit it into a hierarchy of timeframes, so that there’s this progression of events that fit into this hierarchical time representation in so that that first step is to take that JSON object and pull off fields or kind of build it into a subgraph. So that you can intersect it with other data.
Kostas Pardalis 35:09
So if either some correctly like Quine does not store the data, its processes input data that’s coming like from stream and outputs, the results to another stream. Is this correct?
Ryan Wright 35:31
Great question. Almost. Quine does store data on it, but it does so using existing storage tools. So there’s a persistence layer built in decline. And that where you can choose how it’s going to store data. So locally, and something like rocks dB, on your, on your local network, or kind of in a managed cluster using Cassandra, there’s this pluggable interface to choose any of several different storage technologies for as quiet as building up the graph, the value comes from, from intersecting events that happen over time in your stream. It’s been an unfortunate limitation in stream processing to say that, well, if you want to join at events through an event stream, we’ll have to hold them in memory. And so what you’ll have to set a retention window for how much data you’re willing to hold on to. So if you’re trying to match A and B, that A arrives first, and you have to hold on to it and wait and keep consuming the stream looking for B, when b finally arrives, you can join it to a that’s great, what about when you have a C and a D and an E, and you need to join more things together, your problem gets a lot more complicated. And to hold all of that in memory, in order to be able to join them together, then you’re forced to use expensive machines with huge amounts of RAM. And to set some artificial time window that says our RAM is limited, we can only hold so much data there. So I’m gonna hold on to a for 30 seconds, a minute, 10 minutes, 30 minutes. But if B doesn’t arrive in that timeframe, I just let it go. And if B arrives after that timeframe, I missed it. We just missed that record, we missed that insight of what we were trying to understand because we had to force this artificial time window, which for operational reasons was the state of the art. So client, Klein tries to solve that by using the storage layer under the hood, just add your data storage separately and kind of known fashion known quantities, you can use robust tools like Cassandra or Scylla dB. And you can set TTL so that you can expire up old data and, or you can hold on to it and whatever way works best for your application. But joining data over time in that stream, in a way that is then robust, durable, and can help overcome this unnatural time window limitation. So that we can find A and B and C and D and E when they all arrive more than 30 minutes apart, or whatever I’m gonna do is so this becomes really important for some use cases like detecting advanced persistent threats in the cybersecurity world. So attackers who are deliberately spreading out their attack over a long period of time, so that they can hide in the high volume of data that’s coming.
Kostas Pardalis 38:47
So just to make sure I understand correctly, you have Quine, let’s say, calculates and updates, like a grouts in memory. And it precedes the state of this graph, to a storage layer that can be anything I don’t know, like, rocks to be your like, Cassandra or something like that. How do you store a graph-like data structure to store last layer, obviously has not been like optimized for that, right? Like it’s something different, I mean, it can be like a key-value store or can be like a relational database. So how do I do that?
Ryan Wright 39:29
Yeah. So, one of the interesting things that Quine does that I have never seen in any other, definitely in any other graph system, and a lot of other data systems to find uses a technique called Event sourcing to store data. So what it actually stores on disk is not the materialized state of every node in the graph, but it stores the history of changes. So when a new record streams in from Kafka, and we’ve got to go create some structuring the data that flows to a point in the graph, a node in the graph handles that and says I need to set these properties. Well, if those properties are already there, if they’re already set, then it’s a no-op. Because we don’t have to store anything on disk. If we’re setting set foo equals bar, but food is already equal the bar. Yeah, so the change, there’s nothing to do, there’s nothing to update. It’s only when Sue we’re setting food equal to Bas, and it used to be equal to the bar, then we’ve got an update, that update gets saved to desk in the event sourcing fashion. That then lets us write a very small amount to disk so that we can swap with a high volume of data that’s streaming through. So we just record the changes. Many times those changes can be duplicates, or no-ops. And so we can reduce what we save. And when data is stored, it gets saved in a fashion that kind of resembles a write ahead log small little write-up that can be done very quickly in a simple structure that looks like a key-value store. So we can take advantage of the data storage that you get high throughput with key-value stores like Cassandra and others, to have real high, high volumes of data moving through, but then building it together into a graph that gives you that expressivity for answering complex problems. And that time dimension is another interesting angle here, too. Because we save the log of changes, we can actually go rewind that entire graph if we need to, and say, here’s, here’s a question, I would like to run this query and get an answer for the state right now. But also, tell me what the answer would have been 10 minutes ago, or we can go for a month ago. same query, just add a time stamp, and Quine will answer what the Quine will give you the answer for what it used to be at that historical plan.
Kostas Pardalis 42:00
And so whenever you restart Quine, you have to go and recreate the graph. How does this work?
Ryan Wright 42:13
So the graph gets stored durably. That log of changes for every node in the graph that’s saved to disk, and then probably immunizations and snapshots get saved as well. Oh, because it’s kind of shrink what has to be replayed. But on demand, when a node is needed for processing in the stream is flowing through when it’s needed, that node goes and wakes itself up. So it hasn’t, it has actors behind the scenes that can take independent action. So they can go fetch their journal of changes, and replay that journal or restore a snapshot, so that that node is ready to go. And all the incoming message for whatever time period it’s meant to represent.
Kostas Pardalis 42:58
That’s awesome. And can you share a little bit of information about the performance and how well Quine scales?
Ryan Wright 43:05
Yeah, so we’ve been trying to find the limit. And we haven’t found it yet. So it might experience trying to did I mentioned before I let some DARPA-funded research projects that were aiming at building big graphs to analyze for cybersecurity purposes. And what we’ve kept finding was that when we used all the other graph systems out there, that if you’re trying to stream data into a graph database, that you can use can run anywhere from maybe like 100, to up to about 10,000 events per second kind of max. And there’s been some iterations on graph databases to get to that 10,000. But there’s sort of kind of this limit at that level. And it’s in the public gets harder when you add in a combined Read and Write workload. So we’re not just writing the graph, we’re also trying to read out the results in real times and publish the downstream. So in our experience, we tried every graph system out there, some of them lose data along the way, which is just not good. Others just kind of have this natural limitation on their throughput, that kind of checks out, that 10,000-ish events per second, depends on the use case. That backed up, we’re deploying Quine in an enterprise environment. We had our customer come to us and say our problem begins at 250,000 events per second. So as you can show us something running that fast, and we can talk. So we stood up client and rented at 425,000 events per second on a cluster that cost $13 an hour.
Kostas Pardalis 44:51
Oh, wow. That’s impressive.
Ryan Wright 44:55
So since then, we’ve kept going past that and So we’ve done over a million events per second. And that’s a million a million events ingested per second. And what we haven’t gotten to actually, and we kind of this, apologies, this long-running kind of long way around two answers to your question, we haven’t gotten to the other fact that to get data out of Klein, you set a standing query to monitor that graph, looking for as complex a pattern as you’d like three node hops 510 50, it doesn’t matter as large and complex pattern as you like conditions and filters, and whatever you need to express, you’re looking for a pattern in that graph, where we’re doing that reading and monitoring at the same time that we’re have this right heavy workload, all the numbers that I was talking about, that I was talking about, getting up to a million events per second and beyond, that’s a million events per second ingested whilst simultaneously monitoring the graph for patterns and streaming out results. So it turns into the equivalent of another— depends on your pattern, but if it’s a five-node pattern, you’re probably doing something lightened another 5 million read queries per second to the equivalent of that we’re not actually doing that you’re not but the equivalent of that, that sort of complex querying of your whole dataset, in order to find every change every update every new instance of the complex pattern you’re looking for. So those stream out of Cline using something that we call standing cruise, a hotspot, it’s like a database degree and you just say, here’s, here’s what I’m looking for, every time you find it, here’s the action I want you to take. So go publish it to Kafka or log it to standard out or save it to disk or even use it to call back into the graph and update something else. So that inquiry then triggers that output and, and kind of seeds it to the next system.
Kostas Pardalis 47:02
And in terms of like, query concurrency, I mean, how many standard queries someone how, and also, how many you have seen deployed out there, like, your customers? Like, how much do we have to really look into this graph to create via the end.
Ryan Wright 47:23
We’ve deployed with hundreds of standing crews so far. And I know, this is something our team was working on recently because there’s some low-hanging fruit to even carry that number up a lot higher. This is one of the reasons for open-sourcing this project is to kind of show off this interesting work that’s been done here. And also just kind of get input and get get insight from any community members who want to get their hands dirty, and look at it and say, hey, here’s a little tweak that we could improve the efficiency over here.
Kostas Pardalis 47:58
Ryan Wright 49:09
Kostas Pardalis 51:08
Awesome. Looking forward to that. All right. I think I really monopolize the conversation here, so Eric, all yours.
Eric Dodds 51:16
Yeah, I think that’s just how the show works. I monopolize, you monopolize, and then the show. Okay, Ryan, let’s talk about— what’s interesting is, I’m thinking about the team that would operationalize Quine. And potentially, it involves a lot of different players from different sides of the technical table. What does that team usually look like? Because you’re deploying something, you’re connecting it to existing pipeline infrastructure, streaming infrastructure, you are setting up a key-value store that manages the log, right, you’re queering that, right? So, can you just explain the ecosystem of the team that would both implement and then operate Quine on an ongoing basis?
Ryan Wright 52:22
Yeah, sure. So what we’ve usually seen, there’s really three roles. And sometimes these three roles are occupied by a single person. Sometimes they’re each occupied by separate teams, just depending on the scale. So one is the architect, the person responsible for kind of the big picture? How do these things relate to how does it flowed together? How does the hazard system scale? How does it connect into the rest of the data pipeline? So the architect working at that level, to see the structure that Quine fits into? Next is the subject matter expert. So the person who eats who says, Here’s my Kafka topic or my Kinesis topic, it’s got data in it, I know what that data means. And I want to turn it into this kind of an answer. So that’s the person who is writing the cipher queries understands how data gets built into a graph, how that graph gets monitored with standing First, turn it into a stream of answers coming out. And then the third role is really the operator. So the operations team or the data engineering team, who is responsible for standing up and keeping the system running over time in the long run. And so those three different roles, as I mentioned, sometimes they’re occupied by one person. And they wear three different hats in those roles. But a lot of times we see, especially at larger companies working at scale, they tend to be separate concerns, and an occupied by separate people. Yep, makes total sense. And then oh, sorry, go ahead. Said the enterprise version of Quine is something that the company behind Quine for. And it’s that’s aimed at just trying to help address the concerns that those teams have in a real high volume enterprise situation. So you want to run a cluster, you want that cluster to be scalable and resilient. So running a cluster members are going to die. That’s just how it’s going to happen. Do you need to have that system running in a way that can be resilient to failure can be scaled over time, can coordinate with and scale with the data storage layer behind the scenes. So that’s really where the commercial work for thatDot tends to focus.
Eric Dodds 54:50
Got it. Very helpful. And then last question here because we’re getting close to time. But we talked a lot about sort of why graph or why Quine, what problems that does it shed and talk to you a couple of examples. Let’s talk a little bit about when. So some of the examples that you mentioned, and even some of the walls, maybe especially some of the scale, requirements 250,000 events per second, sound like very enterprise-level problems, right, like high scale enterprise level problems, which makes sense, right, a lot of sort of emerging technology. This is a pattern that we’ve seen over and over again, emerging open source technology will sort of be built to solve these, like, large-scale problems where, okay, well graph had a limit of 10,000 events per second. And so it really wasn’t a viable solution for these high-scale systems. And so technology will emerge to solve this problem that existing technology couldn’t. Do you see that trickling down? And is there a use case for smaller companies that aren’t at large scale? And so maybe that’s the question of not necessarily “why graph,” that makes sense. But maybe “when graph?”
Ryan Wright 56:12
Yeah, great question. We’ve started seeing this. And I think this is part of the reason why there’s so much buzz around the open source community is that even in a small scale, we’ve seen users taking Quine and using it as part of their data pipeline to put some of the expensive interpretation step ahead of their expensive tools. So you’re streaming data, and it’s headed to Splunk. And maybe you don’t have a million events per second, going into Splunk, or else you’d be spending billions of dollars a month on your Splunk bill. But you’ve got data flowing through and downstream, it’s going to go through some expensive analysis, maybe it’s some complex machine learning, that has to happen, maybe it’s being loaded into Splunk, or some expensive tools downstream. We’ve seen users put Klein into their pipeline to do some of the early processing that can take some of those expensive downstream analysis steps out of the, out of the budget, basically, yeah, do of doom upstream so that you don’t have to pay for it downstream. And a lot of times, those aren’t millions of events per second kind of use cases, they’re more reasonable scales in that the hundreds or 1,000s or 10s of 1,000s kind of scale, that let you put the pieces together, understand what you’re working with, reduce your data, and, and have the reduced amount of data be more meaningful, so that you can either more effectively or more cheaply and efficiently do the downstream analysis that you need?
Eric Dodds 57:58
Yeah, super interesting. And is that? I know, there are several factors there, it’s almost like a first-pass filter or sort of, it’s like a pre-compute before it gets to the expensive compute layer. And how much of that, it sounds like, there are two sides of that, right? So you said you were running, 450,000 events per second at 13 bucks an hour. So there’s a, like a, an infrastructure, architectural advantage, where, because of where the system is stacking the way that it’s architected, it just sounds inherently that you’re sort of realizing sort of low costs at scale. But on the other side, that’s due to the nature of graphs, right? So like the graph, you’re really leveraging the power of graph as the pre-compute, and it just so happens that the architecture runs very cheaply. Is that the best way to think about it?
Ryan Wright 59:00
Yep. Yep, completely agree. And that, that step to assemble your data as we described, it’s like, one record at a time, but you spread it out for the sake of connecting it. And once it’s connected, you can see all these meaningful patterns of the kinds of things that we just naturally talk about, that get built over time. And so that’s why a lot of times we see that the total volume of data go down because you can have a lot of things come in a lot of partial patterns get built, but not necessarily be what you need to analyze. So we like to say high volume in high value out that the goal is to reduce, how much did it comes out, but have it be more meaningful, more understandable? More important?
Eric Dodds 59:52
Yep. Makes total sense. All right. Well, we are, we’re at time here. But before we hop off here If our listeners want to learn more about Quine or thatDot, where should they go?
Ryan Wright 1:00:05
Quine.io is the home of the open-source project. So there’s documentation on there. If it’s getting started tutorials, you can download the executable. It’s packaged up as a jar or Docker image, handful of different formats. So go check out Quine, pull it down, and it’s free. See if it can help in your data pipeline. If, if that helps solve an important problem for you, or if you have questions about it, and you want to share, share your questions or share your story. There’s a Slack community linked from Quine.io. So that you can join and ask questions or kind of share stories about what’s going on. And if that helps solve an important problem, and you want some help and scaling that or commercial support for it, behind the scenes, thatDot.com is the company behind it. The creators of quiet and our team of excellent engineers and others who are working towards supporting that.
Eric Dodds 1:01:03
Awesome. Well, thank you so much for the time, Ryan, I learned a ton. And it was such a great conversation. So thank you.
Ryan Wright 1:01:09
It’s a pleasure to be here. Kostas, Eric, thank you very much.
Eric Dodds 1:01:13
That was a fascinating conversation. Kostas, I don’t know if I have one major takeaway, the volume and events per second a million events per second is, is pretty wild. So that was certainly a takeaway. But I think the larger takeaway, we didn’t necessarily discuss this explicitly. But what I’ll be thinking about is, we’ve talked about batch versus streaming on the show, actually good debt. And what strikes me about flying is that it is fully adopting streaming architectures. And then it’s assuming that the future will operate on primarily a streaming architecture, which is pretty interesting. And that architectural decision for Klein, I think, says a lot about the way that the people who created it, see the future of data.
Kostas Pardalis 1:02:08
Yeah, I think it’s also like a more natural part of the reason for the kind of stuff that you’re doing with graphs. Because graphs naturally arise, like in systems where you call like events. And their actions in general, also, at least tend to be like streaming the major. So I think MCs makes a lot of sense. I think what I felt extremely interesting with the conversation with Ryan is that it’s, it feels, to me at least thoughts. They have managed with wind-like to figure out exactly what’s like the right type of problem that makes sense to solve with both streaming and grass processing. And also the degrees in a way, that’s very, let’s see, like, complimentary, almost like adding, like a streaming claim or Asr in with the streaming. But the graph layer on top of whatever you already have, you’re then having another yet another monolithic database system that do like everything around like graphs. So that’s what I found, like extremely interesting and like think I’m obviously like, very curious to see where Quine will get. But these conversation may be very hard to read, like excited about the progress of these projects, and especially because of these, like productivity decisions that they’ve made.
Eric Dodds 1:03:43
Sure. It’s very cool. All right. Well, thanks for joining us on the show. Subscribe if you haven’t, tell a friend and we will catch you on the next one.
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at email@example.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.