Episode 38:

Graph Databases & Data Governance with David Allen of Neo4j

June 2, 2021

In this week’s episode of The Data Stack Show, Eric and Kostas talk with David Allen, a partner solution architect at Neo4j. Together they discuss writing technical books, integrating something like Neo4j with an existing data stack, and many different use cases for graph databases.

Notes:

Highlights from this week’s episode include:

David’s background in comparative databases (1:50)
David’s experience and lessons he learned from writing his book (3:23)
How writing a technical book compares to writing technical documentation (4:41)
The process of writing a book (6:30)
The best and worst part of David’s book writing experience (8:02)
An introduction to what Neo4j is (9:08)
What you need to graph (11:13)
Typical problems a graph database is a good solution for (13:00)
The difference between performance and relational databases (18:41)
How Neo4j addresses performance and ergonomics (23:30)
Neo4j and scalability (26:20)
How Neo4j fits in the modern data stack (31:48)
Neo4j use cases (35:45)
Practical implementation of Neo4j (40:51)
Neo4j’s relationship with open source (45:50)

The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 0:06

The Data Stack Show is brought to you by RudderStack, the complete customer data pipeline solution. Thanks for joining the show today.

Today on the show, we have David Allen. He works for a company called Neo4j and they are a graph database. This is probably going to be a pretty technical discussion, which we really like. I’m interested to ask about how graph can be used in the context of event data. That’s something that I deal with every day in my job and we see that there’s the clickstream thing. Graph seems like a really interesting application for that, so that’s my burning question. Kostas?

Kostas Pardalis 0:53

I want to stick on Neo4j and, in general, graph databases as fit to a modern data stack. I’m quite familiar with the products. I’ve seen how it grew from the early days of Neo4j as an open-source project. It has matured the loads and, of course, the market has changed a lot, so I really want to see what kind of use cases exist out there today for graph databases.

Eric Dodds 1:17

Awesome. Let’s jump in and talk graph with David.

David, welcome to The Data Stack Show. We’re really excited to talk with you about all sorts of things, but especially graph.

David Allen 1:30

Thanks for having me today. I’m looking forward to the conversation.

Eric Dodds 1:32

We love getting into the technical details but, as I’m known to do, I want to start with hearing your background and then I have a question about something non-technical, so just give us the two-minute overview of who you are, what you’re doing, and how you ended up where you are.

David Allen 1:50

I started off my career in data management consulting. In that capacity, I was usually working with corporate customers doing things like building ETL pipelines, helping them with data quality problems, helping them with Master Data Management and governance type stuff. After a couple of jobs in that area, I ended up doing some applied research and development for government at a company called MITRE. It was actually when I was at MITRE that I ran into graph for the first time. I didn’t actually go straight to Neo4j after MITRE. I went on and did some work as a CTO for a startup and did a couple of other jobs and ended up coming back to Neo4j when I found a position opened up and I wanted to re-engage in the graph space. I think of myself as coming from a background of comparative databases and, at one point or another in my career, I seem to have ended up having a situation where I needed to use all of them.

Eric Dodds 2:47

Very cool. I want to dig into graph stuff because there’s so much there, but you’re actually an author. You wrote a book and you have spent some time in the academic space, so I’m really interested to know—on the authorship side of things—what was that experience like? Were there any lessons from data management or the engineering side of things that you took into the process of writing a book? Now, of course, it was a technical book, but that’s cool that you’re an author and work in the technical side of data.

David Allen 3:23

It’s been a while since that book was published. When you start the process of writing a book, you definitely bring all of your technical experience in and you set out to try to summarize some of what you’ve learned in the context of technical publications, whether it’s a book or a blog post, any form of publication like that. In terms of the lessons learned, wow, there were a lot of them for me in that process. One of them is don’t get into technical book publishing if you’re trying to make money, that’s for sure. You set out with a clear outline and you’re like, “Okay, I know what I want to say in this book.” In developing the connective tissue and making the entire story cohesive, you find that you end up having to go do some more basic research to fill in the gaps (areas where you thought you knew something but you didn’t actually) to make the whole thing hang together. When I started the process, I thought it was going to be “Okay, we’re going to sit down and write down everything we’ve learned.” It wasn’t that exactly. We had to fill in a lot of gaps.

Eric Dodds 4:28

Interesting. How would you compare the experience of writing a technical book to maybe something more like writing technical documentation?

David Allen 4:41

I’ve written a lot of technical documentation over time. It’s definitely much more narrow in scope and it has a more focused audience whereas, in a longer form, book type of setup, you’re kind of expected to give some element across the whole spectrum of understanding oriented material of key concepts and the theory behind it. Then there’s also the “how-to” oriented material where you say, “Okay, this line of code, and then that line of code.” Then there’s the tutorial element, which is how to apply your knowledge to a novel situation. When you’re writing technical documentation, I find it usually falls into one of those categories: explanatory, conceptual, how-to, or tutorial. When you’re in the longer form (like if you were writing a book), you end up having to figure out how to do the proper balance of all of that to give somebody a comprehensive view on a subject.

Eric Dodds 5:42

Kind of like a narrative arc, almost.

David Allen 5:45

Yeah, there you go, like how colleges structure course materials. There’s this one-on-one progression up to the higher-level course material. You have to think about that knowledge path when you’re writing a longer piece of technical information because you want to jump in and explain the complicated stuff but you have to lay the groundwork and get some conceptual machinery out of the way so the more complicated stuff will make sense later on. In the context of a tutorial or a blog post, you never do that. You just say upfront “you need to know these three things before you read this” and then that’s that.

Kostas Pardalis 6:22

That’s super cool. Can you tell us a little bit more about how it feels to write a book? What is the process? How long does it take?

David Allen 6:30

I’ve only done this once. I wouldn’t represent myself as an expert in the area, but it’s a pretty grinding process. You first start with a proposed outline before your project is even accepted. And you typically work with an editor who provides feedback on that outline. And then you get to something like an annotated outline, which is, thinking of it as a fleshed-out outline with just like a list of bullets for each subheading of what you would talk about or how you would approach that. And then you go through the grinding process of developing out the first draft of each of the sections. And then typically, there are expert reviewers, so they get other people in your field, to read the book and provide notes so that the publisher themselves understands that they’re not putting out bad information about this or that topic. When you get to the point where you’ve got a relatively mature draft, there’s a lot of edits that go into that, and then there might be post-production stuff like, “Is the publisher going to work with a company that’s going to build an index and those sorts of things?”

Kostas Pardalis 7:30

That’s super interesting.

David Allen 7:32

It’s a long process.

Kostas Pardalis 7:37

It sounds like it is. The reason I’m asking is I have no idea how a book is written. I’m pretty sure many people don’t know how much work is put in behind all these books that we see out there. It’s a very good opportunity. The last question about the book, I promise. What was the best part and what was the worst part of writing this book for you?

David Allen 8:02

The worst part was the endless revisions and not being sure what the finish line is. I think all book authors go through that at one point or another. Not to dwell on that too much, I’m in the tech industry in the first place because I enjoy learning and I enjoy the process of learning, so I would say discovering the gaps in my knowledge was the most fun part. I set out to write down what I thought I knew and, as I started getting deeper into it, I started to discover, “Okay, the way you’re thinking about this or that topic is a little fuzzy and needs some more details so now I have to go do basic research.” It is as much a learning project as it is a writing project, and that’s actually fun to me. I like that software and technologies put me in a position to keep learning new stuff.

Kostas Pardalis 8:44

That’s an amazing point. Nice. Really, really nice. David, let’s start talking a little bit about Neo4j and graph databases. Could you give us a quick introduction to what Neo4j is?

David Allen 9:08

Neo4j is what we call a native graph database. Graphs are a data structure that folks may or may not be familiar with. Basically, every time you go to a whiteboard and you draw a bunch of circles and lines connecting the circles on the whiteboard, you are describing a graph. Graphs are composed of nodes, which are those circles that you draw, and relationships that link the nodes together. What people find, particularly in the whiteboarding context, is that this structure is very rich, very easy to work with, and very associative and more similar to how the inside of your own head operates then, for example, something like a table or a document. When you’re working with tables, you make a list of records, and so you can think of every table at the end of the day as being a list of sorts. Graphs are just this flexible, open-ended data structure with a lot of links back to basic computer science that you can use to represent any form of data under the sun. When people most often get exposed to graphs for the first time, it’s usually in the context of something like a social network. Imagine I draw my account as a circle on a whiteboard. Then I might draw all the accounts that I follow as other circles, and then draw a line from me to them saying, “I follow those accounts,” and so forth. If you blow that out and you think about the entire Twitter user base and those follower relationships, that is most naturally represented as a graph. You go to LinkedIn and you say, “who’s connected to who,” it’s the same structure again. Facebook, same structure again. That’s how people usually first come to it. It’s in the context of other business problems where we introduce how to apply that kind of graph thinking to lots of other kinds of data.

Kostas Pardalis 11:02

What do we need to graph? Are the relational databases we have so far not expressive enough for these types of problems that we usually solve with graph databases?

David Allen 11:13

It’s not a question of expressiveness. I’ve used a lot of different databases. There are these different data models: graph, relational, JSON documents, key-value stores. It would be silly to say that there’s some kind of fact about the world that you can say with graphs that you can’t say with tables because you can represent literally anything with tables. I find it better to think about different kinds of databases more like tools in a toolbox. The question is not whether it’s more representationally powerful. The question is whether it’s the right tool for the job. If you have, let’s say, a million customers in a table and you want to know all the customers whose zip code is 23229, that’s not a particularly graphing problem. However, if you wanted to, for example, calculate somebody’s Kevin Bacon score (like how many hops away they are from Kevin Bacon in terms of the movies they’ve made), that requires you to navigate a complex set of relationships between records. That is a very graphic problem. The way I try to sum this up in the simplest possible way is to say that sometimes the relationships between the data items matter more than the data items themselves. If that sounds like it fits your problem, you’re probably in the graph space.

Kostas Pardalis 12:37

That’s a very good way to put it. I really like how you describe the difference there. You said people usually get introduced to graphs by the social graph. That’s what everyone has heard about. Can you give us a few examples of typical problems that the graph database is a good solution for?

David Allen 13:00

Yeah, sure. Let me tell you the story about how I started using graphs because this is the one that’s most direct to me. I was working for a government research and development company called MITRE. And we were developing a solution for data provenance. And so we had these executives, senior executives from the government come along and say, “I got this intelligence report” or “I got this estimate” or “I got this financial number” and I don’t know whether I can trust it or not because we have internal process weaknesses in our organization. They wanted to know how we arrived at this judgment or decision. In order to know whether the information was good or not, we had to trace back to figure out how it was put together. We might say, “We gathered some facts from Bob. Then Bob summarized them in this report, and then that report was processed by that system,” and so forth. Basically, I want you to picture a family tree for any given report you might see. That family tree is a graph because we can say data came from sources, was transformed in certain ways, and then got summarized into the graph. The way I first came to graph is that I wanted to build directed acyclic graphs of data provenance so I could answer these questions for these executives. I first did it on top of my sequel using two tables and joins between the tables (which my sequel is perfectly capable of doing) and discovered Neo4j later when I found that it was much easier to develop with and much faster for my purpose. That little story, that’s data lineage, which is fundamentally a graph. We have a lot of banks and financial institutions that use it for fraud analysis. A question might arise, “Is this particular payment or bank transfer fraudulent?” Sometimes that’s very difficult to answer with the isolated details of the payment. Whether it’s a large payment or a small payment doesn’t really mean that it’s fraudulent or not fraudulent, but if you can connect it with the wider community, you can say, “This payment is one of 10 other payments that is all going into one bank account which has been accumulating a suspicious amount of funds.” The pattern of relationships would give you a stronger basis to cast suspicion on that one transaction, for example. A lot of other companies use it for recommendation engines. We have some retailers, for example. You go onto their website and add something to your shopping cart, they would like to recommend a product you might like. One of the ways you can do that with graph is to do a social type of recommendation and say, “You have added an item to your cart that a lot of other users liked so we’re going to recommend the other items they enjoyed” or “we’re going to recommend items purchased by users whose behavior is similar to yours.” When you think about how to express those questions, it’s all about the relationships between users, products, order baskets, and behavior over time. Those tend to be naturally graphic problems where some of the techniques that Neo4j has make your life easier.

Kostas Pardalis 16:15

That’s super cool. Do you think it makes sense to say that graph database is a good tool for any problem that cares more about the relationships and directions between entities, instead of the entities themselves?

David Allen 16:31

That’s definitely a part of it. If you care more about the relationships than the items themselves, that’s probably a good tip-off. We just talked about this in terms of connected data—and that’s a pretty wide space—but folks know these problems when they see them inside their organizations like, there are other cases where we’ve done a lot of work like Master Data Management where you might need to connect the metadata from 15 different databases and ask questions about where our customer IDs split across lots and lots of different databases. In turn, that is also a connected data kind of problem, so these kinds of connected problems run the gamut. I don’t usually say that the graph database is the optimal solution for every single problem under the sun. But the expanse of connected data problems is is wider than most people appreciate when they first see it.

Kostas Pardalis 17:31

It seems from what you’ve shared so far with us that there are a couple of problems that fall under the category of data governance. In general, graph databases are a very good fit for that, which is very interesting. I wasn’t aware at all that these kinds of problems are solved with graph databases. That’s super, super interesting. We hear a lot about data governance in this show. It’s a very hot space. Right now, there are many products popping out there like about solving very specific problems around data governance. At least, I knew in the past that data lineage is a pretty tough problem to solve, so I’d like to ask you a little bit more about that. There is also some selfish reason behind that because I’m very interested in it. You said that you used graph database to do that and that you used to try to do that using MySQL. My first question is, what was the difference between the two? In terms of performance and also in terms of a mix of using one system and the other to solve the same problem. How would you describe this?

David Allen 18:41

This kind of goes back to what we were talking about a little while ago about the difference in representational strength and how there isn’t one. So the way that you model a graph in a relational database, we’ve seen 100 different customers do this 100 different ways, but the patterns are all very similar. Basically, what you do is you create a node table, like imagine that we have a person table, and then you create a separate many to many join table. Let’s call the table “person knows person,” so then—if I want to query the database to see if David knows Kostas—then what I do is I joined the person table to the join table back to the person table, and then if that crosswalk exists, then that relationship and your graph exist. That’s a perfectly fine way of doing a graph. Other times, if all people need is a hierarchy, they will put, for example, a parent ID on a column and they’ll say, “The parent organization is ID 15,” a foreign key that links back to itself. The same table could be a form of a graph link. If you want to then traverse the graph, let’s say you want to navigate from one node to another node in this graph, you’re always going to be doing that by SQL joins, and that’s okay, but here we’ve already planted the seeds of where your performance problems are going to be. The issue with SQL joins is that they need to be re-computed each time. And so relational databases have gotten really good at this, don’t get me wrong, they got 30 years of research, that goes into projecting out just the right rows very quickly and optimizing that joint process. But the uncut around double effect is that you’re typically going to repeat this every time. So this in turn means that if you want to do lots and lots and lots of joins, I’m talking about 810 15 or more, you’re going to be multiplying that computation burden. So from a performance perspective, you can see that navigating the graph via joins is going to scale poorly, no matter how you set your relational database up. From an ergonomic perspective, SQL isn’t meant to do this. If you’re expressing a relationship to reversal, as a join, you already have an ergonomic gap there. It is extremely difficult to express things, like I want to know the people that I am connected to that are between two and five hops away from me. In other words, when you want to place constraints on the length of the path that you’re navigating— I tried to do this when I wrote a custom stack of Java software on top of my sequel for my provenance database. And I found that it is possible, I kind of had to go up the mountain and consult with the local SQL gurus. They gave me a set of SQL constructs that I could use for recursive table joining and even bounded recursion to do those sorts of things. But I’m telling you that the sequel was crazy complicated. When I finally managed to write it, it performed poorly and I could see that that was going to get worse as my traversals got deeper, and as the total volume of data that I was dealing with got larger. That’s the point in my journey with my sequel for storing graphs where I had to say, “I got to take a step back. What am I trying to do here? I’m trying to implement a graph abstraction on top of something that’s not a graph, so are there any options for me out there that will actually store graph as a data structure?” That’s when I found Neo4j. Sometimes, when we’re talking with customers, we showed them this slide where it’s a very graphic query. The slide is something like “find all people that this manager manages up to three levels down and count the number of people per manager,” and you show like a four-line cipher query, and then you show some gigantic, awful SQL query that does the same thing. We use that as a jumping-off point to describe these ergonomic differences between the query language but really, if you boil out all the detail, what it really gets down to is using the right tool for the job. It’s easier to use a graph query language on top of a graph structure than to use a table query language on top of a table abstraction of a graph.

Kostas Pardalis 23:09

That makes total sense. That’s exactly what I was hoping to hear from you. I had SQL in my mind when I was asking this question. How does Neo4j address performance and ergonomics? What is Neo4j doing differently?

David Allen 23:30

Let me take those individually. On the performance side, you’ll see Neo4j talk about itself as a native graph database. What that really means is that the data structures all the way down to what gets written to disk and what is stored in a graph is optimized for a graph structure. Some folks may be familiar with the idea of sparse matrices, or how you can represent a graph as a matrix. The way Neo4j does it is basically, nodes and relationships are always fixed-length records. And relationships are effectively not much more complicated than two-pointers, a pointer to the originating node and a pointer to the terminating node of the relationship. Now, Neo4j likes to have most of the graph structure live in memory where we can. And so the fixed-length record aspect means that from a performance perspective, you can jump to the right node just by doing some pointer arithmetic and memory offset. And the fact that relationships are pointers means that traversing them is literally just dereferencing a pointer. And so the way a native graph database works in memory is the combination of those techniques loaded hot in RAM. So that graph traversal really, when you strip away all the fancy aspects of it becomes pointer chasing. That’s a very fast operation to do in the main memory. In terms of ergonomics, Neo4j has the cipher query language. For people who are not familiar with it, I usually just describe it as SQL for graphs. It has SQL-inspired syntax. Things like where skip limit. All of that stuff operates the same and Neo4j as it does in SQL but cipher focuses on letting you describe the graph pattern that you’re trying to match and then letting the database figure out how to go get it. In terms of ergonomics: as a developer, I draw a distinction between declarative languages and imperative languages. Broadly, declarative languages tell the database what you want and it’s the database’s problem to go figure out how to do that. Imperative language is more like a traversal, where you give the database explicit instructions and say, “Go to this node. Now expand out to that node. Now expand out to that node,” and so forth. Cipher is a declarative language and that’s part of why it has better ergonomics for graph as you just describe the pattern that you want. You use SQL similar syntax, and you let the database work it out.

Kostas Pardalis 26:04

Sounds great. You said that Neo4j refers to operating memory. How does this translate into scalability? How well does the system scale?

David Allen 26:20

As somebody who’s been in engineering for a while, I dislike the scalability question because I usually want to break it up into lots of different kinds of scalability, you can scale storage, compute, you can scale, high availability, attributes, and so on. Neo4j does like to live in memory. And so usually what we advise customers is to have some healthy percentage of the total size of their database have that much memory, we have an arrangement called fabric that’s used for distributed databases. So if you can’t fit your entire graph in memory, what you can do is partition your graph out. And you can have many different database management systems that store pieces or shards of that graph so that you’re not restricted to how much RAM you can get in one box.

Kostas Pardalis 27:07

David, your point about scaling was great. What was from all these different scalability problems? What was the most interesting and challenging from an engineering perspective to solve for Neo4j? I ask this question having in mind graph partitioning, in particular, because I always thought that making a graph database distributed is kind of a hard problem. Can you share a little bit more information about that?

David Allen 27:34

Oh, it is a hard problem. Let’s think about how best to approach this. Usually, the hardest part about scalability for me is that you’re usually trying to optimize across a lot of different axes at the same time. For example, when people say they want to be able to scale the number of reads, they’re also typically implying that they don’t want the reading or write performance to degrade while they’re doing that. You can’t just make some element bigger, you have to retain a lot of other things. At the extremes of scalability, what I’ve always found with databases is that you end up making some compromises. For example, the original eventual consistent databases all came along at a time where people were trying to scale write volumes to the point where they couldn’t maintain the strong acid guarantees and scale rights to that degree. The hardest part of scalability is the 10 Staffel rule, like “there ain’t no such thing as a free lunch” or “better, faster, cheaper, pick two.” Neo4j is a system that is basically trying to maintain strong consistency throughout and puts a premium on making sure that the read performance and throughput are really quite good. In taking those design decisions, certain kinds of scalability might be a little bit more difficult. Other systems might take a drastically different approach and have different scalability attributes at the cost of different trade-offs.

Kostas Pardalis 29:13

I love your definition of scalability in relation to trade those because, in the end, that’s exactly what scalability is: finding the right trade-offs based on the problem you’re trying to solve. That’s perfect. That’s an amazing definition.

David Allen 29:27

To take that one step further to graph partitioning, you said graph partitioning is a hard problem. I completely agree. So there are a lot of other systems where they will automatically partition your graph. They say, “Hey, you just got a bunch of nodes and relationships will handle the sharding for you.” They make an explicit trade-off that users aren’t usually aware of. If you threw everything into a MySQL table and then you did horizontal or vertical partitioning of those tables to distribute your “graph,” you can see how, in doing so, you would be creating a lot of breakages where, in order to traverse a relationship, you would need to cross between shards. That’s an expensive thing to do in all distributed databases is moving, doing the network coordination between shards. And so when we think about how to partition a graph, there’s one way to partition it, where it allows you to write an unlimited amount of data at the cost that your read queries might get increasingly expensive and difficult to do. And there might be another way of partitioning your graph that is really taking into account what the schema and connected components of the graph are, which can preserve really high throughput and performance at the expense of making the partitioning scheme. Easy, more difficult to create, and maybe not automatic. That’s that kind of manual sharding versus automatic sharding. Automatic sharding is clearly possible, but it creates some trade-offs that you might not realize you’ve made until you’re already at scale.

Kostas Pardalis 31:07

In the case of Neo4j, which one do you recommend to your customers?

David Allen 31:15

In the case of fabric, we’re usually recommending that the customers come up with a partitioning scheme that makes sense for their data. I was just reading this crazy Twitter thread about this last week. It has to do with how to create cut points in your data model such that when you distribute your graph across multiple partitions, you are minimizing the number of times you’re going to have to cross partitions, and that is the property that’s going to mean that your queries are still going to perform well at scale.

Kostas Pardalis 31:48

One last question from me and then I’ll let Eric ask his questions because I’m monopolizing the conversation here. Can you give us a little bit more information from an architectural standpoint about how Neo4j fits in the modern data stack? How does it degrade or interoperates with other common data systems and their organization from what you’ve seen from your customers?

David Allen 32:17

Let’s first picture Neo4j as a complete black box in order to attack the interoperability point. We need to see that there has to be some way to get data into the box in some way to get data out of the box. On the inside, we have a set of supported connectors that are available at the same price that comes with the commercial software. We do Kafka, for example, and Spark. We have a connector for business intelligence that basically treats the database as a JDBC endpoint. Between those options, driver applications, and also the ability to do things like load CSV, a lot of things that I do with customers involve creating ingest and egress pipelines. On the ingest route, you can do it either batch or streaming. A common architecture might be something like, my upstream Oracle system is publishing all changes on to a Kafka topic. Neo4j is reading from that Kafka topic and is transforming those records into a graph pattern, let’s say three or four nodes linked together with some relationships. Neo4j is following along with whatever is coming from Oracle. We are augmenting that with some reference metadata that comes from other systems. Here we have a knowledge graph in the center. On the egress side, it’s again, kind of a situation of whether you want to do it batch or streaming. You can do it streaming via Kafka or Ms. Kay, things like that. If it’s batch, you can write a program, you can use things like Cloud Functions like Amazon, Lambdas, and so forth or you can use stuff like data, bricks, notebooks, I’ve been using a lot of Spark and data bricks myself lately for working with Neo4j, to get the data downstream. Let’s zoom out from that perspective for a second: you have this black box Neo4j and you have a good set of options to get data into graphs. You have good options to get data out of graphs back to tables, or documents, or whatever it is that you need. You can look at that entire architecture as like a graph coprocessor on top of any other system. For example, some people will do things like graph-assisted search, they might have Elastic Search that they’re using for website search. But they might also load their products taxonomy into Neo4j and then change their website search so that it’s still primarily Elastic Search text, but it’s also being informed by search expansion from the Knowledge Graph. You might search for black tube socks, and via the knowledge graph, I might expand that out to also include search results, for example, because we know that those are related from the knowledge graph, even if tube socks as a piece of text would never match leggings in, in text form. So then you can basically take this graph architecture and you can add it on to other systems. So we’re not necessarily trying to replace them or say that Neo4j has to do everything for you. But it can add a lot of graph value to whatever it is that you’re already doing. That same kind of pattern tends to repeat itself when it comes to financial fraud, fraud detection engines, recommendation engines, and so on. Some of the more interesting stuff that that that I get to play with these days has to do with putting graphs into machine learning pipelines in exactly that pattern I’m describing,

Kostas Pardalis 35:45

That’s super interesting. Do you also see use cases where Neo4j is used together with an OLAP system, like for BI or analytics workloads?

David Allen 35:59

Definitely, yes. For BI or analytics, can you give me a little bit more on the question? What areas do you have in mind?

Kostas Pardalis 36:08

I’m talking more about, you know, the typical cases inside the company. And like around reporting, like, we will try to figure out, for example, how marketing is performing. Right, one of the reasons, I think that Eric will have much more questions around that, in marketing is attribution, which, in my mind, at least, you can consider like attribution as a kind of graph, or you have the customer journey, which is okay, I have the customer. This is the first touchpoint, this is the second touchpoint, all these problems that usually combined is to say they put them under the umbrella of a traditional BI because it’s more something that tries to explain what happened. That’s what I mean, if you see use cases there.

David Allen 36:48

Yes, definitely. The intent of what you’re asking, I break into two categories. Sometimes Neo4j is the system of record because it’s a transactional database. Basically, people are trying to do BI on what’s happening in their Neo4j system. In that case, we have this connector for BI where you treat it like a JDBC endpoint and then you can use Tableau against Neo4j if you want to do that traditional BI stuff. A different use case is when Neo4j isn’t the system of record but it might store, for example, a knowledge graph or a set of taxonomies that you’re going to use as a reference data source in the BI or in the analytics process. That is also a pattern, but a different one.

Eric Dodds 37:31

That’s super interesting. One follow-up question on that. Being from a marketing background, that got my wheels turning on the marketing attribution side and thinking about the customer journey. The knowledge graph is one component of that. But do you see any use cases around behaviors, right? So when we think about behaviors that are related, and if you think about that, in the context of maybe bi, or just a traditional database, you’re talking about just tons and tons of SQL, which you talked about before, right? So you have different behaviors for seven different tables. Then you’re trying to tie that together using unique identifiers, which are the change by table. And so you end up with these monster queries to pull together like a basic user journey. I’d love to know about any use cases for graph helps make that more elegant or easier to do.

David Allen 38:23

Yeah, absolutely. Here I’d point people to a use case on our website from Nordstrom. It has to do with personalized product recommendations and clickstream data. Earlier, we talked about how people use graphs for recommendations, like what might a person want to buy. At Nordstrom, they’re taking all these different events that are occurring on their website (and that’s partially the user journey through a website) and then using that to inform what the recommendation is. Every time somebody makes a touchpoint with your company—whether it’s downloading a white paper or sending an email or downloading a piece of software or something along with that—that is part of their journey. And you can use graphs to basically pull that together and look at a whole bunch of users in a cohort. And so a very useful analysis to do might be to say, of the people who ghosted us and never called us back. What similarities did their journey have with the people who later bought the product? What were the most common early steps that they took on our website? You build graphs of these kinds of user journeys, and this is where some of our graph algorithms stuff comes in, where you can use it to establish similarities between nodes or you can do certain kinds of graph partitioning that can help you see some of these patterns.

Eric Dodds 39:58

Very cool. I know we’re getting close to time here. But one thing that I think would be helpful, we always like to try to talk practical implementation stuff with our desk, just because I think it’s helpful for our audience in terms of building a stack. So super interesting to hear that there are different ways on the ingress and egress side to connect, to connect into Neo4j, what do you typically see in terms of stage of a company, other tools that they might be using? I’m sure that there are people in our audience thinking graph might be a really interesting way for us to make some of the things that we’re trying to do in your traditional warehouse a lot easier. What’s the point at which it makes sense? What’s the process of implementing that into your stack look like?

David Allen 40:51

The companies we work with are all over the gamut in terms of their technical capabilities. In the last couple of weeks, I’ve worked with some that said, we’re entirely on-prem. And data can only get in and out using this particular enterprise ETL suite. And so it’s got to be Informatica, for example. And that ranges all the way up to companies that have adopted a whole lot of cloud-native services. They’ll say things like, “all of our workloads are running on top of Google Kubernetes engine,” so we need help doing the network bits, to use such and such a managed service together with Gk EDA to ingest data. So they’re really all over the map. Usually, when I go talk to a customer, first, I try to establish what their baseline is. I love all those native cloud services. And I think that they have a lot of value to offer. But it’s also a massive learning curve. And it’s something that I find within some of these companies is best adopted, piecemeal, or like a bit at a time. Because if I go in with an architecture that calls for the use of Kafka as a managed service and storage triggers based on s3 or something like that, I could really lose people. A good architecture is one that has to be operable by the people, organizations, and policies that you have around you. How to get started? Maybe don’t get started on a big architecture before you’ve proved the value to yourself. Usually, I tell people to take a look at Neo4j.com/sandbox and that will give you a playground where you can play with technology, load your own data, see if you can get some value out of it, and that helps people get started thinking about, “what does my data look like as a graph and can cipher help me?” That sort of stuff. I find that the architecture bits then come as maturation of that basic value proposition. You say, “Alright, I want that value, but I want it at a larger scale or with more timeliness.” Then all of those questions about ingress and egress and architecture get tackled in the context of “how do I extend some piece of value that I already know I want?”

Eric Dodds 43:11

Would you say there’s sort of a lightbulb moment people have when they start playing with the sandbox? I know there are a lot of people out there who have lived and worked in your traditional data warehouse paradigm (maybe for their entire career) and are familiar with graph but haven’t really dug into it in terms of practical value in their data stack. What is it? Can you describe that lightbulb moment a little bit when you say, “Oh, wow, this could be a really valuable component of the SEC?”

David Allen 43:45

I’ll give you an example of a light bulb moment, but I think a lot of people might have to try it out with their data to see what it’s going to be in their business context. Imagine that you had a list of roads and you knew how long each road was and how much distance there was between each city and you had a “shortest path” problem. You say, “Well, I live in Richmond, Virginia and I want to get to Washington, DC along the shortest possible path.” Think about the kind of thing that Google Maps does every single day. Each road that you take has a distance and a speed limit. And so it has a cost to traversing. And so you want to figure out the shortest path from Richmond to Washington. Figure out how you would do that with a relational database, and then get back to me, then go take a look at what it would do what it would take to do that in cipher. And you’ll have that lightbulb moment. Now, some people won’t connect with that experience, because they’ll say, “I don’t have Google Maps Data. And that’s where you’ve got to try it out with the sandbox load or your data into a basic data model. You’re going to find that usually the “aha moment” is some sort of a variable-length path query. Because that is the sort of thing that graphs makes super, super easy and relational databases and other databases just don’t. There are certain use cases that are more relational where you’re not going to have that “aha moment.” I gave the example earlier: if you have a million customers, you want to do all the ones where the zip code is 23229. Graphs can do that, but that’s not particularly a graph problem.

Eric Dodds 45:21

That makes total sense.

Kostas Pardalis 45:22

I have one last question for you, David, and it’s about open sources. What’s the relationship of Neo4j with open source and how important has it been for the whole community for nephrology? So far, Neo4j was born as an open-source product and the Community Edition is out there right now under a variant of the GPL, but here’s an open-source version available right now.

David Allen 45:50

I’m a person who has written multiple different open source packages that relate to Neo4j. One called Neo4j Helm helps. You deploy it on Kubernetes. Another called Halen, which is a monitoring tool. I don’t know what else to say other than we’re strong proponents of open source. And that’s been a core part of our business since the beginning. Now not as a Neo4j employee, but I think that open source is going through an interesting evolution. Right now we’re seeing in the industry that comes from the increase in cloud platforms, and also people shift towards managed services. Consider that nobody ever asks whether Gmail is open source or not. And so when you use things like managed services, what open source means really starts to change. And I don’t, I don’t really have anything so great to say about that. Other than that, as a practitioner and somebody in the industry, I’m pretty interested to see what’s going to happen in the coming years.

Eric Dodds 46:47

Absolutely. The RudderStack, where Kostas and I both work, we’re open source as well and it’s going to be a fascinating environment to operate in the coming years. We’re at time, David. This has been so interesting, I learned a ton. I think our audience has learned a ton. I have all sorts of interesting ideas around graph and with all the new knowledge you’ve given us. So really appreciate you taking the time.

David Allen 47:13

Thanks for having me.

Eric Dodds 47:15

Always an interesting conversation on The Data Stack Show. We took a little aside there to talk about writing books, but I think it’s interesting when someone has done something maybe related but as an activity outside of their day-to-day work with data, so it was fun to hear about the process of writing a book. One of the interesting things that stuck out to me was how straightforward it seems to integrate something like Neo4j with an existing data stack. We talked about meeting from a Kafka topic, which is a really, really common structure within data stacks these days, so that was exciting to me. I hope our listeners had some ideas about how they may be able to try this out with some of their existing data stack infrastructure.

Kostas Pardalis 48:03

Yeah, absolutely. From my side, I was quite surprised to hear all these different use cases. It was super interesting to see that graph databases can be used as part of the product experience but, at the same time, they can also be used for a lot of typical analytics workloads. What was really, really interesting for me is how it can be used inside the context of data governance. That’s something I definitely want to learn more about.

Eric Dodds 48:31

We’ll extend the show just another two minutes because I want to ask you about this. We talked about data governance a good bit in one of our recent shows. Do you think that graph is potentially one of the ways to solve data governance at a comprehensive level? If you can connect all of the data in your stack to it, it may make some of that governance workload easy.

Kostas Pardalis 48:59

It’s not going to solve all the problems that there are under data governance. I don’t think that’s the case. It’s what you need in order to manage access to data, things like he’s doing. The thing with data is that data gets continuously transformed inside the company. It is considered by many different stakeholders, many different systems. In the end, we need to track the evolution of data. A graph structure makes a lot of sense there. It’s a very native way of representing how a piece of data has evolved in time. With a lot of people, they’re active, they know these things. For stuff that is like data lineage, for example, that’s probably a very, very good way to represent this. Now, that’s one part of the problem, right? Then you need to feed the database with all the data and, most importantly, the metadata, which is another question we need to ask our guests. How do you get this data? How do you generate this metadata to fit a graph database to solve the problem?

Eric Dodds 50:12

Very interesting. Time will certainly tell. Thank you again for joining us on The Data Stack Show. Be sure to hit subscribe in your favorite podcast app, that way you’ll get notified of new episodes when they go live, and we will catch you on the next one.
The Data Stack Show is brought to you by RudderStack, the complete customer data pipeline solution. Learn more at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 38:

Graph Databases & Data Governance with David Allen of Neo4j

June 2, 2021

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter