Episode 42:

Graph Processing on Snowflake for Customer Behavioral Analytics

June 16, 2021

In this week’s episode of The Data Stack Show, Eric and Kostas talk with the co-founders of Affinio Tim Burke and Stephen Hankinson. Affinio’s core intellectual property is a custom-built graph analytics engine that can now be ported directly into Snowflake in a privacy-first format.

Notes:

Highlights from this week’s episode include:

Launching Affinio and the engineering backgrounds of the co-founders (2:36)
The massive transformation in customer data privacy regulation in the past eight years (6:23)
Creating the underpinning technology that can apply to any customer behavioral data set (10:05)
Ranking and scoring surfing patterns and sorting nodes and edges (14:13)
Placing the importance of attributes into a simple UI experience (19:28)
Going from a columnar database to a graph processing system (25:20)
Working with custom or atypical data (32:46)
The decision to work with Snowflake (37:43)
Next steps for utilizing third-party tools within Snowflake (52:18)

The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:06

Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Eric Dodds 00:27

Welcome back to The Data Stack Show. Really interesting guests today. We have Tim and Stephen from a company called Affinio. And here’s a little teaser for the conversation. They run in Snowflake, they have a direct connection with Snowflake, but they do really interesting marketing and consumer data analytics, both for social and for first-party data, using Graph, which is just a really interesting concept in general. And I think one of my big questions, Kostas, is about the third-party ecosystem that is being built around Snowflake. And I think that’s something that is going to be really, really big in the next couple of years. There are already some major players there. And we see some enterprises doing some interesting things there. But in terms of mass adoption, I think a lot of people are still trying to just get their warehouse implementation into a good place and unify their data. So I want to ask about that from someone who is playing in that third-party Snowflake ecosystem. How about you? What are you interested in?

Kostas Pardalis 01:32

Yeah, Eric, I think this conversation is going to have a lot of Snowflake in it. One thing is what you’re talking about, which has to do more with the ecosystem around the data platforms like Snowflake. But the other and the more technical side of things is how you can implement these sophisticated algorithms around graph analytics on top of a columnar database like Snowflake. So yeah, I think both from a technical and business perspective, we are going to have a lot of questions around how Affinio is built on top of Snowflake. And I think this is going to be super interesting.

Eric Dodds 02:07

Cool. Well, let’s dive in. Tim and Stephen, welcome to The Data Stack Show. We’re really excited to chat about data, warehouses, and personally, I’m excited to chat about some marketing stuff, because I know you play in that space. So thanks for joining us.

Tim Burke 02:21

We’re excited to be here. Thanks for having us.

Eric Dodds 02:23

We’d love to get a little bit of background on each of you. And just a high level overview of what Affinio, your company, does for our audience. Do you mind just kicking us off with a little intro?

Tim Burke 02:36

Yeah, absolutely. I’d be happy to. So it’s a pleasure being on with you guys. And realistically, just to give you a quick sense of what Affinio was all about, and a little bit of background. So we created Affinio about eight years ago. It started off with a really simple concept where eight years ago, Stephen and I happened to be running a mobile app B2C company. And instead of looking at social media to see what people were talking about our brand, we started off with a really simple experiment of looking at who else our followers on social media were following. And that afternoon, we aggregated that data and saw a compelling opportunity against this interest and affinity graph that nobody seemed to be using or utilizing for basically advertising and marketing applications. And we thought it was just a huge opportunity. So we doubled down and created what continues to be our core intellectual property, which is a custom built graph analytics engine under the hood. And what we’ve done is, over those eight years, basically leveraged analyzing essentially social data as a starting point. But more and more, we had many of our enterprise customers really excited about what they could unlock from both insights and actionability against the data that we were providing them with, as well as basically using our technology. So over the last two years, we made a conscious effort to double down and start porting a lot of that core graph technology directly into Snowflake. And most recently, and we’re just about to announce, the release of four of our essentially apps inside the Snowflake marketplace, that enable organizations to essentially use our graph technology directly on their data, without us ever seeing the analytics and without us ever seeing the output. So it’s in a completely private format, all leveraging the secure function capability in Snowflake and the data sharing capability. So super excited to be here. And we’re obviously huge fans of both Snowflake as well as warehouse first approaches, and we think the opportunity between Affinio and RudderStack is a great compliment.

Eric Dodds 04:38

Very cool. And Tim, do you want to just give a quick 30 second or one-minute background on you personally?

Tim Burke 04:43

Yeah, certainly. So, I’m Tim Burke, CEO of Affinio. My background is actually in mechanical engineering, Stephen, who’s on the show and my CTO and co-founder. We’ve been working together for 12 years now. Both engineers by trade, he’s electrical, I’m mechanical. I do a lot of the biz dev and sales work within Affinio, obviously from my position, a lot of customer-facing activities. And I’ll let Stephen introduce himself.

Stephen Hankinson 05:08

Now Stephen Hankinson, CTO at Affinio. Like Tim said, I’m an electrical electrical engineer. But I’ve been writing code since I was about 12 years old and I just really enjoy working with large data, big data, and solving hard problems.

Eric Dodds 05:21

Very cool. Well, so many things to talk about, especially Snowflake and combining data sets. And that’s just a fascinating topic in general. But one thing that I think would be really interesting for some context, so Affinio started out, providing graphs in the context of social. And one thing I’d love to know. So you started eight years ago, and the social landscape, especially around privacy and data availability, etc, has changed drastically. And so just out of pure curiosity, I’m interested to know what were the kinds of questions that your customers wanted to answer eight years ago, when you introduced this idea? And then how has the landscape impacted the social side of things? I know, you’re far beyond that, but you have a unique perspective in dealing with social media data over a period where there’s just been a lot of change, even from a regulatory standpoint?

Tim Burke 06:23

Absolutely. I would say you nailed it on the head. It’s been a transformational period for data privacy, customer data privacy. And that first and foremost has probably been one of the biggest impacted areas, social data as a whole. So, we’ve definitely seen a massive transition. Right. I mean, I would say that a lot of that transition over the last few years is, is partially a a change in our focus for that exact reason, right, recognizing that deprecations in public API is deprecation available privacy aspects of that data availability across social has changed drastically, right. And so, for us, we’ve been at the front of the line watching all this happen in real time. But for us, the customers at the end of the day are still trying to solve the same problem, it’s how do I understand and learn more about my customers such that I can service them better, provide better customer experience, find more of my high value customers, like, net-net, I don’t think the challenge has changed, I think the assets against which those the data assets against which those customers are actually leveraging to find those answers are going to change and have been changing, right. And so what we’re trying to do is our move from our legacy social products. Much of the time was addressing deeper understandings of the interest profiles and rich interest profiling of large social audiences is kind of where we get started. And for us, that’s one of the most obviously, one of the most valuable assets or valuable insights for a marketer, because when you understand the scope and depth of your audience’s interest patterns, you can basically leverage that for where to reach them, how to reach them, how to personalize content, knowing what offers they’re going to want to click through to. And I don’t think that’s actually changed, right? I think that what people are recognizing more so than anything, and obviously, you guys would see this firsthand, as well as many of those data assets that I think many organizations were willing to either have vendors collect on their behalf or own on their behalf, it has changed drastically. And now it’s basically requiring these enterprises and organizations to own those data assets and be able to do more with them. And so what I would say is, what we’re seeing firsthand is, the markets come around to recognizing the need to collect a lot of first-party data. Many organizations have obviously put a lot of effort and a lot of energy and a lot of resources behind creating that opportunity within an enterprise. But I would say quite honestly, what we see is that there’s a lack of ability to make meaningful insight and actionability from those large datasets that they’re creating. So that’s kind of what our focus is on is trying to enable the enterprise to be able to unlock at scale applications no differently than what we’ve done previously on massive social data assets. But this time, on their first party data, and natively inside Snowflake in a privacy-first format.

Eric Dodds 09:27

Super interesting. And just one more follow up question to that. I’m at risk of nerding out here and stealing the microphone from Kostas for a long period which I’ve been known to do in the past. But in terms of graph, was the transition from third-party social data to accomplishing similar things on your first-party data on Snowflake? Was that a big technological transition? I’d just love to know from under the hood standpoint. How did that work? Because the data sets have similarities and differences.

Tim Burke 10:05

No, it’s a great point. I mean, for those not familiar with graph technology, obviously, the foundation of traditional graph databases are founded on transforming relational databases into nodes and edges, right, and looking for essentially, connectivity or analyzing the connectivity in a data asset. So our underpinning data technology which Stephen created firsthand is this custom built graph technology, it analyzes data based on that premise, it is everything’s a node, everything’s an edge. And at the primitive level, it enables us to ingest and analyze any format of customer data without having to do drastic changes to the underpinning technology. And so what I would highlight is that we’re the most compelling data assets that we can analyze, and the most compelling insights you can gather, typically are driven by customer behavioural patterns, right. So unlike traditional, I would say demographic data, which has its utility, and obviously always has in a marketing and advertising application, but I would argue that demographics has traditionally been used as a proxy to a behavioral pattern, right? And what we see, and what we see, the opportunity to unlock is that if you’re analyzing and able to uncover patterns inside of raw, customer behavioral patterns, what you as a marketer or an advertiser want to do is ultimately change or enhance that behavior, right. So instead of using demographics as a way to slice and dice data and create audiences, which ultimately are simply a surrogate to that underpinning behavior looking to change, what we’re seeing, and what we see as an opportunity is across these massive data sets that are basically being pulled into Snowflake and aggregated in Snowflake, when you start to analyze those behaviors at the raw level and unlock patterns across massive number of consumers at that level, you can then start actioning on that, and leveraging those insights for advertising, personalization targeted campaign, next price offer in a format that basically is driven by you unlocking that behavioral pattern.

Tim Burke 12:15

So for us, you can think of it you know, when I speak of customer behavioral pattern, everything that you know, relates to transactional data, content consumption patterns, search behavior click data, clickstream data, I mean, all those become signals of intent of interest, and ultimately, are a rudimentary behavior, which for us, we can ingest transform that data into a graph inside of Snowflake, analyze those connections and similarity patterns across those behaviors natively in the data warehouse. And then in doing so create, therefore, audiences around common interest patterns, and look alikes, and build propensity models off those behaviors. And so, so the transformation uniquely, I mean, I wouldn’t understate it. And Stephen, obviously put a lot of time into that transformation. I think it was more so that we had initially architected the underpinning technology for the purpose of a certain data set, what we unlocked and identified was, there was a host of first party data applications we could apply this tag to, and that was the initial aha moment for us in terms of moving it into a Snowflake instance and in Snowflake capability so that we can basically put it in apply to any customer behavioral data across that, across that data set.

Kostas Pardalis 13:31

That’s super interesting. I have a question. Stephen might have a lot to say about that. But you’re talking a lot about graph analysis that you’re doing. Can you explain to us and to our audience a little bit, how graphs can be utilized to do analysis around the behavior of a person or, in general, the data that you’re usually working with? Because from what I understand, like the story behind Affinio, when you started, right, you were doing analytics around social graphs, right, where the graph is like a very natural kind of data structure to use there. But how can this be extended to other products and to other use cases?

Stephen Hankinson 14:13

Yeah, I’d say one example of that would be in surfing patterns, like Tim had mentioned. Or, essentially, we can get a data set of basically sites that people have visited and even keywords on those sites and other attributes related to the site’s times that they visit them. And essentially, we can put that all together into a graph of people traversing the web. And then we’re able to use some of our scoring algorithms on top of that. So essentially, rank and score those surfing patterns so that we can essentially put together people or users that look similar into a separate segment or audience that then we can essentially pop up and show analytics on top of so people can get an idea of what that group of people enjoy visiting online or where they go or what the types of keywords that are more looking at online based on the data set that we’re working with. I guess that would be one example of a graph that’s not social, for example.

Tim Burke 15:11

And I’d just pick up on that Kostas as well. I mean, I think the thing that we see is that the as Stephen alluded to, at the at this lowest level of the signals that are being collected what we’re creating in just to liken it to a social graph, obviously, you have a follower pattern, which defines and creates essentially, the social graph, what we’re doing is taking those common behaviors as basically the the nodes and edges. So as Stephen alluded to, whether it be sites that people visit, whether it be content, similar content that they’re consuming, whether it’s the transactional history that looks similar to one another, the application effectively is just how we transform to your point, those those individual events into essentially a large customer graph on first party data within the warehouse. And then like I said, then from there, the analytics and applications are very, very similar, regardless of whether you’re analyzing a social graph, a transactional graph, or you know, a web surfing graph, it ultimately comes down to what your what your definitions are for those nodes and edges at the core.

Kostas Pardalis 16:16

Yeah. And what’s the added value of trying to describe and represent this problem as a graph instead of like, I don’t know, like more traditional analytical techniques that people are using so far?

Tim Burke 16:30

For us, it comes down to I mean, specifically segmentation at the core of what advertisers and marketers do on a daily basis is cut and slice and dice data, oftentimes, is restrictive to a single event, right? So find me the customers that bought product X, find me, the customers that viewed you know, TV show Y, oftentimes is restricted in the analytics capabilities. within the scope of that small segment. What we’re doing is we’re able to take that segment, look across all their behaviors beyond them beyond that initial defined audience segment. And by compiling all those attributes simultaneously, inside of Snowflake, we’re actually able to uncover the other affinities beyond that. So besides watching TV show X, right? What are the other shows that are of that audience are over indexing or have high affinity? Besides buying product Y, what other products are they buying? And those signals from a marketers perspective start to unlock everything from recommendation engine, next best offer, net new personalized customer experience recommendations, right in terms of recognizing that this group as a whole has these patterns.

Tim Burke 17:47

And that’s at the core when you think of it, you can certainly achieve that in a traditional relational database. If you have two, three, ten attributes per you know, per ID. When you start going into scale, we’re analyzing with our technology inside of Snowflake, you’re talking about potentially hundreds of millions of IDs, against tens of thousands to hundreds of thousands of attributes. So when you actually try to surface and say like, what makes this segment tick, and what makes them unique, trying to resolve that and identify the top-10 attributes of high affinity to that audience segment is extremely complex in a relational database or relational format. But using our technology and using graph technology, the benefit is that that can be calculated in a matter of seconds inside the warehouse, so that people like marketing, and advertisers can unlock those over-indexing high affinity signals beyond the audience definition that they first first implied, and that helps with everything, like I said, understanding the customer all the way through to things like next best offer as well as media platforms of high interest.

Kostas Pardalis 18:53

Right. That’s super, super exciting for me. I have a question. That’s more of like a product related question about not technical, but how do you expose this kind of structure to your end user, which from what I understand is marketeer, right. And I would assume that most of the marketeers don’t really think in terms of graphs, or it’s probably like something a little bit more abstract in their heads. Like, can you explain to me how you manage to expose all these expressivity that the graph can offer to this particular problem? To a non-technical person like a marketeer?

Tim Burke 19:28

Yeah, no, for us? I mean, it’s a great question. For us. A lot of what we created eight years ago is, and even the momentum on on our social application eight years ago, was the simplicity of those identifying those over indexing signals, the ability to do unsupervised clustering on those underpinning behaviors to unlock what I would deem these data driven personas. And so we put a lot of energy into trying to restrict how much data you surface to your end user and trying to simplify it based on their objective. And so you know, a key element to that and recognizing that within the framework of these applications that we built inside Snowflake, our end user actually does not get exposed to the underpinning, graph-based transformation and all the magic that’s happening inside of Snowflake. What they do get exposed to, and what our algorithms able to do is essentially surface in rank order, the importance of those attributes, and place those into a simple UI experience. And the benefit at the end of the day, is that because all these analytics are running natively, inside Snowflake, any application that has a direct connector to Snowflake can essentially query and pull back these aggregate insights. So think of that from you know, from a standard BI application that has a standard connector into Snowflake, with, with very little effort, they can essentially leverage the intelligence that we’ve built inside of Snowflake, and pull forward, essentially based on an audience segment definition the over indexing affinities in rank order for that particular population. So, I think the challenge for us, I think you nailed it, for many in the marketing field, graph technology is not one of their primary backgrounds, and certainly not you know, if you ask them how would you use a standard graph database, that’s not something that most people are thinking about. What they are, though thinking about and thinking hard about is, again, it’s these simple definitions of like, what are the other things or what are the things that make an audience segment unique, make them tick, make them behave the way they behave. And unless you approach that problem statement, with a graph-based technology under the hood, it’s extremely complicated, extremely challenging. And for many organizations we work with they talked about the fact that what we’re unlocking inside the warehouse in a matter of seconds, would traditionally have taken a data science team or an analyst team, oftentimes days if not weeks to try to unlock and so, for us, it becomes scalability, it’s the it’s the repeatability of these types of questions that guys like, Eric, I’m sure live and breathe every day is like, what makes a unit of an audience tick, right? And whether that is like, of the people who churn what are the over indexing signals so that we can plug those holes in the product, whether that’s of the high value customers, what makes their behavior on our platform unique, those are the things that we’re trying to unlock and uncover for a non technical end user. Right, because that is their daily activity, they have to crack that nut on a daily basis in order to achieve their KPIs. And so, that’s what we’re most excited about is we, I think Stephen and I, eight years ago, graph technology, certainly, as it pertained to applications and marketing, was really still very, very new, I would still say it’s still very, very nascent. But I mean, I think it’s coming of age, because as we grow the data assets inside of things like Snowflake’s data warehouse, unless you can analyze across the entire breadth of that data asset, and unlock in a an automated way, these key signals that , make up an audience, but the challenge will always be the same. And the challenge is going to get worse, right? Because we’re not making datasets smaller, we’re making them larger. And so the complexity and challenge associated with that just increases with time and for us, like that’s what we’re trying to, we’re trying to trivialize and say, listen, there’s repeatable requests to a marketing analyst and to a marketing team and to an advertiser and a media buyer and, and dominantly they’re affinity based questions whether people recognize it or ask it as such. But a lot of the time, that’s exactly what it is. Of the person who just signed up on our landing page, right? Like, what should we offer them? Right? What are their signals? Can you know what kind of signals influence what we recommend to them, how we manage them, how we manage the customer experience, how we personalize content. So those types of questions we see on a daily basis are trying to be addressed by marketing teams, many of whom don’t have direct access, obviously, to the raw data. And that’s why a lot of our technology natively inside of Snowflake is unlocking the ability for them to do that in aggregate without ever being exposed to private or row level data.

Kostas Pardalis 24:21

That’s amazing. I think that’s one of the reasons that I really love working with these kinds of problems and engineering in general, this connection of something so abstract, as a graph is to a real life problem, like something that a marketeer is doing every day. I think that’s a big part of the beauty behind doing computer engineering, and I really enjoy that. But I have a more technical question now. I mean, we talked about how we can use these tools to deliver value to the marketing community. So how did you go from a columnar database system like Snowflake into a graph processing system? How do you do that? How did you bridge these two different data structures at the ends from one side, you have more of a tabular way of representing the data or columnar way of representing the data. And on the other hand, you have something like a graph. So how, how do these two things work together?

Stephen Hankinson 25:20

Yeah, so basically, what we end up doing is we have some secure functions in our arsenal account that we share over to the customer. And then what that does is, it gives them a shared database, which includes a bunch of secure functions that we’ve developed. And then we essentially work with the customer to give them either predetermined functions that are queries that they will run on top of their data, based on the structure of their tables. And the function, the queries that we would give to them essentially will pass their raw data in through our encoders and that will output this new data into a new table. And that really just looks like a bunch of garbage if you look at it in Snowflake. It’s mostly binary data, but it’s a probabilistic data structure that we store our data into. And then with that probabilistic data structure, they can then use our other secure functions, which is able to analyze that graph based data and output all the insights that Tim was mentioning before. Essentially, you just feed in a defined audience that you want to analyze. And it will run all the processing in the secure function on top of that, the probabilistic data structure, and then output all the top attributes and scores for the audience that they’re analyzing.

Kostas Pardalis 26:38

Oh, that’s super interesting. Stephen, can you share a little bit more information about this probabilistic data structure?

Stephen Hankinson 26:45

Yeah, it’s essentially putting it in a privacy safe format, that basically is feeding in all the IDs with different attributes that they want to be able to query against, essentially, using some hashing techniques to essentially compute this new structure that has been able to be bumped up against other encoded data sets of the same format. And then once you mash them together, essentially, you can use some algorithms that we have in our secure function library. And from there, you get all kinds of things like intersections, overlaps, unions of all kinds of sets. It’s basically doing a bunch of set theory on these different data structures in a privacy secure way.

Kostas Pardalis 27:29

Yeah, that’s super interesting. And there’s, I mean, there’s a big family of database systems, which is actually graph databases, right? So from your perspective, why is it better to implement something like what you described, like, compared to getting the data from a table like on Snowflake, and fitting it into a more traditional, let’s say, kind of graph system.

Stephen Hankinson 27:55

I think that is the main benefit of doing it this way, because they don’t need to make a copy of their data and they don’t need to move their data, it simply stays in one place.

Tim Burke 28:03

Yeah. And I would, I would just add to that Kostas as well, right. I mean, when we speak of the benefits of Snowflake’s underpinning architecture, and the concept of not moving data, for us what we’re not what we’re not trying to do is replicate all functionality of a graph database, there’s obviously applications in which case that is absolutely suitable and reasonable to do, and make an entire copy of the data set and run the type of analytics inside the warehouse. But what we’re trying to do is take the applications where all the marketing and advertising, productize them in a format that does not require that and still leaves the data where it is inside of Snowflake provides this level of anonymization. And I would also highlight the fact that Stephen’s code that does the encoding of that new data structure also enables out of five to one data compression format, which also supports basically more queries for the same price when it comes down to this for this affinity based querying structure.

Kostas Pardalis 29:03

Yeah, that’s, that’s very interesting, like, this discussion that we’re having about, like the comparison of between, like having a more niche kind of system around graph processing and general kind of graph database. It’s something that reminds me a little bit of something that happens also, like here, at RudderStack from an engineering point of view, because we have built a part of our infrastructure and needed some capability similar to what Kafka offers, right? But instead of incorporating Kafka in our system, we decided to go and build part of this functionality over Postgres in a system that’s tailor-made for exactly our needs. And I think that like finding this trade off between a generic system towards a function and something that is tailor-made for your needs. It’s like what makes engineering as a discipline super, super important. I think at the end, this is the essence of like making engineering choices when we’re building complex systems like this, trying to figure out when we should use something like more generic as a solution, or when we should get a subset of this and like, make it tailor-made like for our problem that we’re trying to solve. And that’s extremely interesting. I love that I hear this from you. We had another episode in the past with someone from Neo4J. And we were discussing this because if you think about it, like a graph database, at the end is a specialized version of a database, right? Like at the end database system, Postgres, can replicate the exact same functionality that the graph database system can do, right? But still, we are focusing more on a very narrowly defined problem, and we can do it even more. And that’s what you’ve done. And I find a lot of beauty behind this. So this is great to hear from you guys.

Tim Burke 30:48

I think it’s also interesting, just picking up on that in terms of the decision around like when do you optimize versus leave it generic. For us a big part of that, you can also see obviously, in the market, right, there’s machine learning and machine learning platforms that can have a host of different models that can be used for a host of different things, the Swiss Army Knife application within an organization. For us anyway, when those custom requests come in from teams, absolutely. Like those types of platforms make a lot of sense, because your data science team has to go in, it’s probably a custom model and the custom question that’s being answered, I think, for us specifically, when it comes time to actually building an optimized solution, something that can be deployed natively inside Snowflake, it comes down to repeatability and efficiency, right. So it’s like, when the same requests are being made hundreds of times a year, should you actually be building a custom model every time? Or should you actually push that workload into the warehouse? And for us anyway, that’s been a specific focus. For those applications of those requests that you can have the marketer self-serve and get the answers they need in seconds, as opposed to putting it on the data science team backlog. Those are the applications for us that we’re focused on and actually pushing in and optimizing.

Kostas Pardalis 32:09

Yeah, yeah, I totally agree. So last more technical question from my side, you mentioned that the the way that like the system works right now is you get the raw data that someone has started in Snowflake, and you have some kind of like encoders or transformers that they transform this data into this probabilistic data structures that you have? Do you have any kind of limitations in terms of what data you work with? Do you have some requirements in terms of the schema that this data should have and what’s like the pre-processing that the user has to do in order to utilize the power of your system.

Stephen Hankinson 32:46

So if it’s in essentially rectangular form data, it’s pretty easy to adjust into the encoder, like we have a UI that will do that for you. But if there are some weird things about the data, that wouldn’t be typical, we can actually work with them. If they give us an example of what the data looks like, we can essentially craft an encoding for a query for them, they just feed everything through. And that will still end up in the right way to go into our encoder and still end up in the central probabilistic data graph format that we use. So we haven’t currently run into any dataset that we haven’t been able to encode. But yeah, it seems to be pretty generic at this point.

Kostas Pardalis 33:28

And is this process something that the marketeer is doing or there’s some need for support from the IT or the engineering end of the company?

Stephen Hankinson 33:36

We usually work with IT at that stage. And then once it’s encoded, the UI will work with the data that’s already being coded. And they can also set up tasks inside of Snowflake, which will update a database over time, or that data set over time to add new records or delete the data as it comes in. But yeah, that is not enabled by the marketeer.

Kostas Pardalis 33:56

All right. And is Affinio right now offered only through Snowflake? Is there like a hard requirement that someone needs to have their data on Snowflake to use the platform?

Tim Burke 34:06

It is currently Kostas. I mean, we obviously went through an exercise evaluating which platform to build on first. I mean, for us, it came down to the two fundamental capabilities within Snowflake, or probably three, I mean, the secure functions that we’re utilizing to obviously secure our IP in terms of those applications that we share over, the ability to do the direct data sharing capability, it was fundamental to that decision. And then the third for us is obviously the cross cloud application and the cross cloud ability for us to ingest data across all three clouds. And so as a starting point, that was our buy in and recognizing that with the momentum, and certainly the momentum that Snowflake has, in specifically the media and entertainment and retail and advertising space, is and continues to be a good fit for our applications at this stage. We’ve had obviously discussions more broadly whether we can replicate for specific cloud applications, but where we are right now, in terms of early market traction, like our bet is on Snowflake and the momentum that they currently have.

Kostas Pardalis 35:12

And Tim, you mentioned, I think earlier that your product is offered through the marketplace that Snowflake has. Can you share a little bit more about the experience that you have with the marketplace? How important is it for your business? And why?

Tim Burke 35:28

Yeah, so I think the marketplace is still in its early stages even with as many data partners that are already bought in. For us, I think one of the clear challenges that we face, we are not data providers. So I think we’re slightly nuanced within the framework of what traditionally has been built up on from a data marketplace, or data asset perspective. We were positioned inside a marketplace deliberately and consciously with Snowflake, because our applications drive a lot of the data sharing functionality and add to the capabilities on top of that data marketplace, you know that people can apply first, second, third party data assets inside of Snowflake and run our type of analytics on top of it. So for us, it’s been unique in the framework of simply being positioned, obviously, almost as a service provider inside of what otherwise is currently positioned as a data marketplace. But recognizing that I think over time, you’ll start to see that bifurcate within Snowflake, and you’ll get a separation and a unique marketplace that will be driven by service providers like ourselves, alongside of straight data providers.

Tim Burke 36:41

So I think it’s, I think it’s early stages. I think what we’re excited about is that we see a lot of our technology is being an accelerant to many of those data providers directly. And many of the ones that we’ve already started working with directly see it as see it as a value proposition and a value add to their raw data asset that they may be sharing through Snowflake, but you’ll see it as a means with which to get more value from that data asset on the customer’s behalf by applying our application or technology in their Snowflake instance.

Kostas Pardalis 37:11

This is great. Tim, you mentioned a few things about your decision to go with Snowflake. Can you say a little bit more information around that and more specifically, what is needed for you to consider going to another data cloud data warehouse, something like BigQuery, or something like I don’t know, Redshift? What is offered right now by Snowflake that gives tremendous value to you, and makes you like, prefer, at this point build only on Snowflake?

Tim Burke 37:43

Yeah, I think if we stood back and actually looked at where Stephen and I started off in terms of our applications within first party, like porting our graph technology into first party data, much of that was very centered on applications and analytics specific to this and enterprises own first party data only, as it pertains to that model, if it was only restricted to that model, I think we would have considered more broadly looking at doing that directly inside of any or all of the cloud infrastructures or cloud-based systems to begin with, but but I would say that ours is a combination of the ability to do analytics directly on first party data of as well, as Stephen indicated a major component of our technology that we’ve created inside of Snowflake, and unlocks this privacy, safe data collaboration across the ecosystem. And so as a result of that, I mean, for us the criteria in terms of selecting Snowflake was, again, the ability to leverage secure and UDF, secure functions to lock it and protect our IP that we’re sharing into those instances. But the second major component is the second half of our IP, which is effectively this privacy safe data collaboration, which basically is powered by the underpinning data sharing capability of Snowflake. And so if and when reviewing or evaluating other applications or other providers in terms of context of where we report this next, I would say that that’s the lens that we look through, right is like, can we can we unlock the entire capability across this privacy safety data collaboration and analytics capability in a similar way that we’ve done it on Snowflake? Because to me, that’s the primary reason why we picked that platform.

Kostas Pardalis 39:34

Yep. And one last question for me, and then I’ll leave it to Eric. And it’s a question for both of you guys. Just from a different perspective. You’ve been around quite a while. I mean, Affinio as you said, started like eight years ago, that was pretty much like I think, at the same time, that Snowflake also started. So you’ve seen a lot around the cloud data warehouse and its evolution, how have things changed in these past eight years, both from a business perspective, and this is probably something more more of a question for you, Tim, and also from a technical perspective, how the landscape has changed.

Tim Burke 40:08

I think it’s absolutely interesting the point that you’re making. I mean, I first learned of Snowflake directly from customers of ours, who were, at the time, asking us specifically about the request. It is very simple. They say, we love what you’re doing with our social data, we would love it natively in Snowflake. And that was honestly the first time we had learned of that application many, many years ago. But what I would say is that as far as the data warehouse has advanced from a technical perspective, I think for us anyway, it still belongs, or certainly has its stronghold directly in the CDO, CIO, and CTO offices within many of these enterprises. What what I expect to see and what I think we’re helping you drive and pioneer with what we’ve built on marketing/advertising is the value of the assets being stored inside of you know, the data warehouse has to become more broadly applicable and accessible across the organization beyond what it traditionally has been locked away to high infosec required data science teams, because I think the value that needs to be tapped across global enterprises cannot funnel directly through just a single team all the time. And I think what we will see, and certainly I think, as early stages are starting to see is awareness by other departments inside the enterprise of even where their data is stored, quite honestly. I mean, there’s still conversations we’re having with many organizations in the marketing realm who have no idea where their data stored, right, so I think familiarity, and comfort level associated with that data asset, how to access what they can now access, how they can utilize, it will become the future, where the data warehouse is going to go. But I think we’re still, we’re still a long way there. There’s still a lot of education there. But we’re excited about that opportunity specifically from the business perspective.

Stephen Hankinson 42:03

And on the tech side of things, I would say the biggest changes are probably around the privacy stuff that has changed over the years where you have to be a lot more privacy aware and secure. And basically working with Snowflake makes that a lot easier for us with the secure sharing of code, and secure shares of data as well. So using that with our code embedded directly into them, you can begin to be sure that customers using this, their data is secure. And even if they’re sharing data over to other customers, it’s secure to do that as well.

Kostas Pardalis 42:35

This is great, guys. So Eric, it’s all yours.

Eric Dodds 42:40

We’re closing in on time here. But I do have a question that I’ve been thinking about, really since the beginning. And it’s taking a step back. So Kostas asked some great questions about “why Snowflake?” and some of the details there. Stepping back a little bit, I would love your perspective on what I will call for the purposes of this, the purposes of this episode, the next phase of data warehouse utilization, and I’ll explain what I mean a little bit. So a lot of times on the show, we’ll talk about major phases that technology goes through. And in the world of technology and data warehouses, they’re actually not that old. You have Redshift being the major player fairly early on. And then Snowflake hit general availability, I think, in 2014, but even then they were still certainly not as widespread as they are now. And the way that we describe it is, we’re currently living in the phase of, everyone’s tried to put a warehouse in the center of their stack, and collect all of their data, and do the things that you know, the marketing analytics tools have talked about for a long time where it’s like, get a complete view of the customer, and everyone realized, okay, I need to have a data warehouse in order to actually do that. And that’s a lot of work. And so we’re in the phase where people are getting all of their data in the warehouse, it’s easier than ever. And we’re doing really cool things on top of it. But I would describe Affinio in many ways as almost being part of the next phase. And Snowflake is particularly interesting here, where, let’s say you collect all of your data. Now you can combine it with all other sorts of things native, which is an entirely new world, right? There are all sorts of interesting data sets in the Snowflake marketplace, etc. But most of the conversation and most of the content out there actually is just about how do you get the most value out of your warehouse by collecting all your data and doing interesting things on top of it. And so I just love your perspective, do you see the same major phases? Are we right in terms of being in the phase where people are still trying to collect their data and do interesting things with it, and then give us a peek As a player who’s in a part of the marketplace, part of the third party connections, but being able to operationalize natively inside your warehouse, what is that going to look like? I mean, marketing is an obvious use case. But I think it’s going to be in the next five years, that’s going to be a major, major movement in the world of warehouse. Sorry, that was long winded, but that’s what’s been going through my mind.

Tim Burke 45:23

Totally, I mean, it is the stuff that we think about and talk about on a daily basis, I think you’re right. I think obviously, the world has already woken up to the fact that gathering, collecting, owning, and managing all customer data at one location is going to be critical in the future, right? I would say COVID has woken the world up to that, in terms that as many of us, as you know, have heard and seen, is that COVID is no better driver for digital transformation than a pandemic. But at the same time, I completely agree with you. What I think personally, and I, and I just given what we’re creating within these native applications inside of Snowflake, I think you will start to see an emergence of privacy safe SaaS applications that are deployed natively inside the warehouse. I think you will, you will see, literally a transformation of how SaaS solutions are being deployed. And I think what you’ll see is organizations like Affinio who have traditionally hosted data on behalf of customers and provided web based logins to to access that data that’s stored by the vendor, I think you’ll see and continue to see a movement where the IP and the core capabilities and the technologies of these vendors will begin to store and start to port natively into Snowflake. I believe that Snowflake itself, and we’ll actually start to find ways to find attribution around the compute and value that those you know, that those vendors like ourselves in the applications that are driving inside of the warehouse, and I think you’ll see that just naturally extend into rev share models, where for the enterprise you sign on to Snowflake, you have all these native app options that you can turn on, automatically, that’ll basically allow you not only to reap more benefits, but just get up to speed and make your data more valuable faster. Right. And I think honestly Steve, and I’ve talked about this for some time now, we honestly see that in the next 10 years, there’ll be a transition. And certainly, maybe it probably won’t eliminate the old model, but you’ll see a new set of vendors that will start building in a native application format right out of the gate, and that I think, will transform the traditional SaaS landscape.

Eric Dodds 46:13

Yeah, absolutely. And one, a follow on to that. So when you think about data in the warehouse, you can look at it from two angles, right? The warehouse is really incredible, because it can really support you know, any well, not necessarily any kind of data, right, but data that conforms to any business model, right. So b2c, b2b, etc. It’s agnostic to that, right, which makes it fully customizable, and you can set it up to suit the needs of your business. So in some sense everyone’s data warehouse is heavily customized. When you look at it from the other angle, though, from this perspective of third-party data sets, and something that Kostas and I talk a lot about, which is common schemas or common data paradigms, right? If you look across the world of business, you have things like Salesforce, right. Salesforce can be customized, but you have known hierarchies, lead contact accounts, etc. Do you think that the standardization of those things, or market penetration of known data hierarchies and known schemas will help drive that? Or is everyone just customizing their data? And that won’t really play a role?

Tim Burke 49:01

You know, that’s a great question. I mean, it’s conversations we had with other vendors in, in many of our customers relative to what they perceive is beneficial to many cdp’s and market, to your point, Eric, right, like where the fixed taxonomies and schemas basically enable an ecosystem and an app ecosystem and partner ecosystem to build easily on that schema on top of that, yeah, completely. You know, I would say that I think it’s still early to see how that actually comes about what I would, what I would say is that I think you will start seeing organizations adopt many aspects within Snowflake and within their warehouse of best of breed schemas for the purpose of as you know, as I would say, as I see this applications, space build out, it’s kind of the way that it has to scale right. So, both from from a partner in marketplace marketplace play as well as the plug and play nature of how you want to deploy this at scale, I mean, ultimately, the game plan would be that, again, all these apps run natively, you could turn them on, they already know what the scheme is behind the scenes, and they can start running, as Stephen alluded to, there’s obviously at this stage, a lot of hand holding at front end, until you get those schemas established and are encoded into a format that’s queryable, etc. So I think what you’ll start to see is best of breed bridging across into Snowflake would be my assumption that I would say. The more that you see, people leveraging Snowflake as a build your own format of Snowflake. It’s kind of required, right? And I wouldn’t be surprised to see some elements of that be adopted across into best of class and best of breed within Snowflake directly for that purpose.

Eric Dodds 50:47

Sure, yeah, it’s kind of, it’s fascinating to think about a world where today, you kind of have your set of your core set of tooling, right, and core set of data and you build out your stack by just making sure that things can integrate in a way that makes sense for your particular stack, which in many cases, requires a lot of research, etc. And it’s really interesting to think about the process of architecting a stack, where you just start with a warehouse, and you make choices based on best of breed schemas. And you know, at that point that the tooling is heavily abstracted, right? Because you are basically choosing time to value in terms of best of breed schemas, super interesting.

Tim Burke 51:37

Yeah, completely.

Eric Dodds 51:39

Alright, well, we’re close to time here. So I’m going to ask one more question. And this is really for our audience, and anyone who might be interested in the Snowflake ecosystem, what’s the best way to get started with exploring third party functionality in Snowflake? I mean, Affinio, obviously, a really cool tool, check it out. But for those who are saying, okay, we’re kind of at the point where we’re unifying data, and we want to think about augmenting it. You know, where do people go? What would you recommend as the best steps in terms of exploring the world of doing stuff inside of Snowflake natively, but with third party tools and third party datasets?

Tim Burke 52:18

I think it all starts with, from our perspective, many of the conversations we have with prospects and customers are around what questions are the repeatable ones you want to get addressed and want to answer it. And in combination with that, obviously, a key element to what you know, these types of applications enable it from a privacy perspective is to unlock the ability to answer those types of questions by more individuals across the organization. So many of the starting points for us ultimately comes down to what are those repeatable repeatable questions and repeatable work workloads that you’d like to have trivialized, and basically plug and play inside of the warehouse that will speed up what otherwise oftentimes is a three-week wait time or a three-week model or a three-week answer? And so I think for us, that’s where we start with most of our prospects and discussions. And I would think for those thinking about or contemplating that, that’s a great place to start is recognizing that this isn’t for this isn’t the silver bullet to address all questions or all problems. But for those that are rinse and repeat and repeatable, these types of applications are very, very powerful.

Eric Dodds 53:30

Love that. Just thinking back to my consulting days, or doing lots of analytics, or even tool choice for the stack. Always starting with the question, I think is just a really, I think that’s just a generally good piece of advice when it comes to data.

Eric Dodds 53:48

Well, this has been a wonderful conversation, Tim and Stephen, really appreciate it. Congrats on your success with Affinio. Really cool tool, so everyone in the audience, check it out. And we’d love to have you back on the show in another six or eight months to see how things are going.

Tim Burke 54:03

Yeah, I would love to.

Stephen Hankinson 54:05

Thanks very much.

Eric Dodds 54:07

As always, a really interesting conversation. I think that one thing that stuck out to me and I may be stealing this takeaway from you Kostas. So I’m sorry. But I thought it was really interesting how they talked about the interaction of graph with your traditional rows and columns warehouse, in the paradigm of nodes and edges. That’s something that’s familiar to us relative to identity resolution in the stuff that we work on and that we’re familiar with. And so kind of breaking down that relationship in terms of nodes and edges, I think was a really helpful way to think about how they interact with Snowflake data.

Kostas Pardalis 54:46

Yeah, yeah, absolutely. I think this part of the conversation where we talked about different types of representation of the data and how its representation can be more well treated like for specific types of questions. It was great. And if there’s something that we can get out of this is that there’s this kind of conception of the data that remains the same at the end, while it is expressed as part of the data, it’s the same thing, right? It doesn’t matter if you represent as a graph, as a table, or at the end as a set. Because if you notice, like the conversation that we had, at the end, they end up representing the graph using some probabilistic data structures that at the end represent sets, and they do some set operations there to perform their analytics. And that from a technical perspective is very interesting. And I think this is a big part of what actually computer engineering and computer science is about right? Like how we can transform from one representation to the other, and what kind of expressivity these representations are giving to us, keeping in mind that at the end, all these are equivalent, right? Like, the type of questions that we can answer are the same. It’s not like something new will come out from the different representations. It’s more about the ergonomics of how we can ask the questions, how more natural the questions fit to these models, structures, and in many cases, also around efficiency. And it’s super interesting that all these are actually built on top of a common infrastructure, which is the data warehouse, in this case, Snowflake. And that’s like a testament of how much of an open platform Snowflake is. Although I mean, in my mind at least the only other system that I have heard of being so flexible, it’s like Postgres, but Postgres, like a database, exists for like forever, like, like 30 years, or something. Like Snowflake is a much, much younger product. But still, they have managed to have an amazing velocity when it comes to building the product and the technology behind it. And I’m sure that if they keep up pace, we have many things to say in the near future, both from a technical and business perspective.

Eric Dodds 56:55

Great. Well, thank you so much for joining us on the show. And we have more interesting data conversations coming for you every week, and we’ll catch you on the next one.

Eric Dodds 57:08

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at Eric@datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 42:

Graph Processing on Snowflake for Customer Behavioral Analytics

June 16, 2021

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter