Episode 125:

Authorization Is A Data Problem with Jeff Chao of Abbey Labs

February 8, 2023

This week on The Data Stack Show, Eric and Kostas chat with Jeff Chao, Co-Founder & CTO at Abbey Labs. Jeff returns for his third appearance on the show and talks about his journey to Abbey Labs, tackling identity, permissions, and passwords in an organization, challenges in data availability, identity resolutions, and more.

Notes:

Highlights from this week’s conversation include:

Jeff’s background at Netflix and Stripe leading him to Abbey Labs (2:22)
What Abbey is solving in the space (5:16)
Tackling permissions in an organization (7:30)
Opportunities to improve the availability of data (10:14)
The challenge of tackling a new problem area at a new company (14:59)
What is the most common challenges in the identity and security space (18:43)
Importance of identity and the ability to track it in data (22:46)
Connecting all the different platforms without frustrating the user (30:32)
What are the parts of access data that needing to be tracked (36:10)
Dealing with the varieties of data in security and managing permissions (40:26)
Final thoughts and takeaways (51:52)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.
Welcome to The Data Stack Show. Kostas, I think this may be our first three time guest on the show. Jeff, we first talked with him when he was at Netflix, we talked with him again when he was at strike. And he has now co founded his own company, Abby, and what an amazing guy, we love having him on the show. And we’re going to talk with him about Abby today, which is in the identity space, but focused on employee identities within a company with the emphasis on security, which is really fascinating. And what I want to know, this isn’t gonna surprise you. But, you know, he built all sorts of crazy streaming technologies at some of the most famous companies in the entire world across a different number of, you know, a different number of problem areas. And that’s pretty different from what he’s building at Abbey. And so when there’s a change like that, I’m always interested in the story behind it, and going to attack a new problem. And so that’s what I’m gonna ask. How about you?

Kostas Pardalis 01:31
Yeah, actually, I think it’s going to be like a great opportunity to understand why the industry believes that security is a data problem. I think we have the right person’s life to help us. So it’s a very common theme that like, it seemed like a song that you shared a lot like lately that security’s there’s a problem. I think for people who are outside of security, it’s hard to understand what does, right. So here we have someone who comes from an incredible background and builds data infrastructure. We decided to go and do the company security, right? I think we’d have the right person there to help us understand why security is the problem. And how VCs implemented it as part of the vision of the Commonwealth, he has found it.

Eric Dodds 02:27
Yep. I’m so excited to chat with Jeff again. Let’s dig in.

Kostas Pardalis 02:30
Yeah, let’s do it. Jeff,

Eric Dodds 02:32
Welcome back. You are this point, a multi time repeat guest. And it’s always such a pleasure to have you on the show. So thanks for joining.

Jeff Chao 02:41
Hey, thanks for having me. Again, it’s good to be back.

Eric Dodds 02:44
Okay, so for the listeners who didn’t get a chance to catch up previous episodes number one, if you’re listening, and that’s, you absolutely need to listen to prior episodes with Jeff, all the individuals in the panel ones. But can you just give us a brief background? And then tell us what you’re doing today? Because you started something new, which is very exciting.

Jeff Chao 03:05
Yeah, sure thing. So the last time I was on here, well, the first time was when I was at Netflix working on streaming data systems. And that was a bit interesting. And because the premise was that we wanted to be extremely cost effective when working with this data. And it was specifically around the observability space, where we wanted to help keep Netflix the service up and running. And so we worked on a system called Mantis, we open sourced that, and it did about some number of trillions of events per day and petabytes of data per day. After that, I went to Stripe where I led a data team. Stripe is really big on eventing systems. And so I led a data team around Change Data Capture, working with some folks on the museum as a competitor to the DBS in Vitesse connector. And then this Change Data Capture system worked with financial data. And it was mid migration before I left. And at this point, it’s 100% migrated to the new system, which does about $640 billion in annual payment volume. And so I thought, hey, things were going to Well, let’s go on to hardmode here, so I decided to leave and start a company in the identity security space. So I am now sitting as co-founder and CTO of ABI labs, and we’re tackling challenges in our own authorization.

Eric Dodds 04:23
Very cool. Okay, I want to dig into abi, and all the things about it. But one question actually for you that I know is stripe is big on eventing. Was it always like that, as the company always had sort of an event driven architecture, or do you know if that was a process that they went through?

Jeff Chao 04:44
Definitely before my time, but looking through the git commit history, the very first incarnation of the CDC pipeline was I think, in 2014. So not everything is invented. Right, the stripe is heavy MongoDB users. And the idea is that developers want to use a tool that works well for them. And so how the standard model is, hey, I have a stateless web app and I write to the database yet, but rather than doing these distributed transactions are complicated joins, let’s have these async systems receive the, like, the individual operations out of the database Change, Data Capture. And then from there, you can fan it out. And people can be adjacent because they want people being served.

Eric Dodds 05:33
Yeah, very cool. No, that’s just a fun bit of history there. Okay, so Abby, give us a breakdown. And what I’d love to hear is, you know, give us a brief explanation of what the product does. But then also, I’d love for you to go back. And, you know, how did you? Where did the idea come from? And how did you decide to start a company specifically focused on this problem?

Jeff Chao 06:00
Yeah, so a couple of questions, there. Definitely were early days. So things are subject to change, for sure, as you all know, but it turns out that as you grow in an organization, as an organization gets larger, it’s probably a pretty good idea to have an understanding of who has access to what. So you can improve security posture, try to enforce least privilege and all those other, you know, buzzwords. But the idea is, as the number of employees in an organization grows, you’ll also see many more services. And each of those services require different permissions at different levels of granularity. And so you end up with this sort of n by m problem that makes it difficult to manage and understand the state of access within your company, quite simply put, who has access to what is a difficult problem to solve? And when you can answer that question in this environment that’s fragmented and ever changing, then you can do other things for your security or your compliance programs. And so I really like the way my co-founder and I are. Well, we’ve been thinking that authentications are pretty mature, right? Now, you’ve got a lot of players there. But the authorization space, it’s still early, early days. And there’s a certain level of maturity that a company has to go through, there’s like this maturity curve like that, you want to get some single sign on, you want to enforce passwords, and then eventually, you’re going to get to this permissions level concern within your organization. So we thought this would be a good place to help people tackle that because the challenge really comes that scale. And the problem is that these teams that are responsible or accountable for ensuring that this stuff works, their headcount stays relatively flat.

Eric Dodds 07:50
super interesting. And when we were talking, you know, catching up before the show, you have an interesting approach to this problem. And you described it as fundamentally a data problem, which I think is really fascinating. Can you break that down for us? How is it first of all, when people do not classify it as a data problem? How do they classify it? And then why do you classify it as a data problem?

Jeff Chao 08:16
Yeah, it just comes from experience of all the different systems I’ve worked on. And you know, this is one of my hot takes, as a data person jumping into this security realm, right. But the idea is you have these different types of datasets, you have identity data, which is like human and machine identity. And so those are like, attributes on who you are, what you are. And then you have access data, which is what are the things you can access? And then you have activity data, like, did you actually do something or access them. And so you know, depending on the size of the organization, this could get pretty, pretty crazy. For two reasons. One is the scale of data, you in this world, you kind of want to have a view of everything. And so sampling is kind of a tricky situation there. And then the other thing is, the data itself is fragmented. If you think of external SaaS applications, there can be many, like even as a startup, we already have so many there’s, you know, there’s like your accounting software, your business software, your engineering, and etc, HR. And then you have internal services. And then you have ephemeral things like workload, and so get pretty unmanageable or untenable very quickly. And so I thought, okay, all of these things are data sources, you want to, they’re generally raw data, some are log looking data, some are more structured. And I want to derive insights on this. And then from those insights, I want to do some sort of automation. So a lot of this is like getting data stored somewhere in the right place, enriching the data, and doing something with those enrichments. Hey, sounds pretty familiar over here. Very cool. Yeah. So yeah, so the corollary is like, well, if I think of these is like verticalized slices of application that like approval requests and approval workflows, or access reviews, or like, like some sort of threat detection and response things, like, our belief is that we can build eventually build each of these use cases, given the set, given a sound foundation that is built upon, like best practices that we’ve learned from in the data space and a little bit also from the observability space, depending on the use case. So

Eric Dodds 10:33
yeah, for sure. And I’m interested to know, as you look at the landscape of that data being created, you know, coming from the data space, do you see opportunities around improving the availability of data there? Because, you know, in the world of data, I mean, if you talk about CDC, or eventing systems, right, that we just discussed, those are very established concepts, decades old, lots of technology, lots of established patterns. And I think a lot of times when you bring a paradigm of okay, well, this is actually a data problem with the core into a discipline that, you know, here two, four hasn’t really been described as a data problem, a lot of time, there’s can be deficiencies, like on the actual data side of things, is that an opportunity area or a challenge to see?

Jeff Chao 11:31
Yeah, definitely. Also definitely learn from what came before, there’s always going to be nuances, right? You can’t just say, oh, let’s sprinkle on some software and call it a day. But so this is like the example of don’t oh, let’s just sprinkle some data technologies and call it a day. I know, it’s not like that. But, yeah, so I think the challenge is, at least for the companies that we’re thinking of, or like, there’s, they might be cloud native, they might not be there, there might be bare metal old school, on prem or on prem with their own cloud account. So depending on your prioritization, you kind of want to consider each of those differently. And so what I mean is, I mentioned the word fragmentation earlier. So the ecosystem is fragmented. So you have API calls you have, if someone’s more sophisticated than Yeah, sure, connect to the stream. Otherwise, I’d be more snapshot based, or there’s different protocols. It’s a bit tricky. So integration is a huge pain point. And there are a lot of players out there, I would like not to build yet another system that ingest data. So yeah, I would like to avoid that. But the problem is, like, it’s easy to ingest the data, in my opinion, like, relatively easy, but the problem is, okay, so I can get the data in, I can send it up in five minutes, an hour or whatever. But then I’m going to have the next level questions immediately after that, which is, okay, well, how do I not do it without blowing my budget? Oh, you’re gonna give me a full refresh every single ingest? I don’t think so? Or how can I do this incrementally? Or the other thing is like, Okay, so the data is good. Like, how good is it? Like, what about data quality? Because like, it’s, I want this to be actually correct. And so ingestion, like getting the data in it’s just first initial problem. And, and I think it’s still early there. There are a lot of players there. But I’m eagerly waiting, because I do not want to do this again.

Eric Dodds 13:29
Yeah. Okay, I know causes so much questions that I do have a question about the name of the company ABI is, I mean, I love the term, you know, you sort of, you know, you think about, like, you know, almost like a monastery, or something of that nature give us the thinking behind the name. It’s such a unique name, especially for the type of we’re talking about identity or you know, security breaches or other things like that. It’s kind of you wouldn’t necessarily think about that, as you know.

Jeff Chao 14:01
Yeah, well, a lot of props, my co-founder on that one. But the idea is like, we believe in bringing peace of mind to companies, especially in this crazy world where like, authorization or permission, getting out of whack. And we believe in doing that without necessarily having to be so masculine about it. And Abby just really came from about where it’s a place where you can congregate and be at peace. So our thinking is that, you know, we can congregate or get out and have this data and then make it available to people in the way they want it and give them the control. And then they can eventually have peace of mind to build out their security and compliance programs. Love it. Costas.

Kostas Pardalis 14:43
Yes. Have many questions. So yeah. Love it. I also would like to tease Eric a bit because it’s a pretty common quote from him. I get it as a signal that my time’s coming you know, like When he says, I know that Kostas has like a lot of, yeah,

Eric Dodds 15:04
That’s your cue? Yeah, that’s our secret. That’s our sag of him. That’s our secret signal.

Kostas Pardalis 15:09
Yeah, so Okay. Before we get into more technology related questions, I want to ask you something a little bit more personal. As a person who didn’t have a career so far, like an engineering ethic, like, let’s say around some specific things, right, like good data about data infrastructure events, in many various different forms. And at some point, you decide to go and they’re like, a new space, right? And, yeah, sure, like, these are data problems. But it’s not only a data problem that you’re solving here. So what’s your experience with that? As an engineer, right, from going from something that you feel comfortable that you have done? Like, a lot of things there your confidence and getting like in the new problem area?

Jeff Chao 16:05
Yeah, that’s a great question. I will say that, even as an engineer, I’ve always, I love the technical side, I always tinker. And, but for me, it’s like solving a larger problem. And for some, a problem for someone that matters, and one that matters to them. And so I’ve always been interested in more about, like customer empathy, even in data infrastructure, always pushing for being like a full cycle developer, where you really own the thing you’re doing end to end. And part of that is understanding that you’re building something with not just the technical thing in mind, you tie the problem to the product to the technical. And so even in data infrastructure, infrastructure, it’s like who are your customers, other engineers or other machines, machine learning engineers etc. Right. And so I’ve always been interested in that. And so part of it is customer empathy. The other is product building. And lastly, I do have a lot of things that I’ve learned over the years in terms of company building and building great teams. And like, put that to the test. See how it goes. Yeah, makes

Kostas Pardalis 17:15
total sense. And I think what you’re saying is, also I think, like PowerShell ‘s response to my next question, which has to do with the experience of going from being employed in a big company to starting your own company, right? Because obviously, like, it’s a different experience. So you again, you personally, like, answered that. But tell me a little bit also about this experience so far, how it feels, from, you know, like being part of like these huge organization into like being you your co founder, and I don’t know how many engineers you have right now, but still to be much more like environment compared to what was before.

Jeff Chao 17:57
Yeah, I will say that, like, as a founder, it takes a certain stat, a certain type of person to do that. But overall, I would say whether you’re a founder or not, the fulfillment is a lot higher, because there’s so much accountability. And so if you thrive on that accountability, that execution, then really this place or any other startup is really the way to go. And a lot of it is like going broad. So if you were trying to go deeper, at least for engineering, I would recommend going to a larger company, you’ll get to see all the patterns, good or bad. And then you can try to, well, eventually, you’d have to pick and choose pieces of that and distill them down into what could be useful to a startup. But fulfillment is the real big winner here. Yeah, to say that it’s not without pain. It’s tons of pain, but also very fulfilling.

Kostas Pardalis 18:47
I found the Uber saying, Yeah, but let’s not focus on the pain today. Yeah. Let’s keep that like for another episode. Actually, we’ll do that after your IPO. Okay, when you don’t have a neighbor show to talk about the pain. Okay. So, okay, let’s talk a little bit more about technology now. Talking about security. And security is like a broader thing, right? Like, there’s no just when I’m thinking of security, you are from whether there’s somebody reading like your landing page, for example, you are talking about identity? Why don’t you give us a little bit of an overview of what security is like, the parts that we most commonly see out there and how identity fits into that.

Jeff Chao 19:32
Yeah. What is security? Oh, boy, that’s a lot. I can make a joke for sure. But I want security, I guess, for me is like tying the business value to the risk. And so obviously, if you have, like different companies have different risk tolerances, that doesn’t mean that they’re less or more secure, right? It’s just tied to the risk model that they have. And that’s tied to the business of arguing that they want to preserve or generate or etc, right. And so around identity, it’s like, in this environment in this cloudy cloud environment, you have multi cloud, you have hybrid cloud, there’s, you know, on prem as well. That’s what I mean with Hybrid Hybrid. But like, the days of being within this single, like, wolves network is no longer a thing and hasn’t been a thing for a while. And so, and especially with the past couple of years, where you have employees, which are not necessarily within the confines of an office and the VPN in a single location, like they’re free to go anywhere as well, like, it really becomes about identity, right, like inadvertent access, or like intentional or malicious access, it’s done by a person, or a thing, which is backed by a person, right? So it all boils down to an identity. And so there’s already a lot to that. So we’re just thinking about the employee identity for now. So identity, there’s like employee identity, or human, there’s service identity, and there’s workload, or machine. And so we’re thinking of that in the confines of a company now. And so what it means is like, imagine if there’s a breach or something, it’s like, okay, what is the impact of that breach? Okay, maybe an account got taken over. Okay, what access levels does this account have? And to which resources? And how can we begin to figure that out? Recruit, like, try to, like traverse that tree? recursively? And then maybe do some communication or some mitigation, etc.

Kostas Pardalis 21:38
Right? And how is identity Established in, let’s say, the most like, traditional approach, like in being desperate right now? Yeah, I

Jeff Chao 21:49
say by far, there’s a maturity curve, for sure. So identity is established through, I would say, through like, your Google workspace, you know, everyone has a Gmail account for their company or something, right? Or maybe their Microsoft or something like that, if they’re a Microsoft shop. And so after that, they might do some simple things around authentication, like, okay, let’s make sure there’s a password rule, like it must be this long with this number of characters. They might have to be refreshed every quarter or something. And then you go up the maturity curve, there might be okay, let’s SSO everything. And then more people join more applications, they might have contractors, people are changing roles. Okay, we need like a single sign on, like a, like an identity provider, oh, maybe try to do something with Google or maybe move to Okta or some other air there. And then after that, it’s like, okay, well, everyone has admin access to everything. So now we need to lock that down for different compliance reasons, that’s the stick, or the carrot would be okay, we actually want to improve our security posture or reduce, like cost. In managing this kind of stuff, we can actually have our employees be more productive and have a better experience.

Kostas Pardalis 23:09
All right, so let’s say you have, like, in my organization, I’m using Opa, right. So I’m like a central repository of identity, I would say, like, everyone needs to go to that to identify themselves. And there is something in the system, right? That represents my identity. Right? Now, these things will seem like, there, there is, I’m seeing something just because like, I want to hear from you what the something is actually. Because in this way we can get into the data side of things that have to travel around, like the different applications and systems that I’ll be interacting with, right? How does this work? And how important is it to trace that? And when it is important to trace that because from like, if you think like from the, from the user perspective, like the employee perspective, right, like, for me, it’s just like something that I have to go through because I’m forced to do it. Like, I need to access 10 different tools. I know, I’ll go to lockdown. Someone will add my applications there. I’ll click on them. And suddenly I have access, and they go to Salesforce, and something like I can do my job there. But I don’t really know what’s happening between the systems. Right. And also, I don’t know what I mean, I have an idea of why it is important to do that. But what is tracked and how it is exposed and who cares about that is not something that I mean, we’re in for a good reason. Like, that’s not my job. Right.

Jeff Chao 24:43
Kenzie, like, takes us through the journey of the data. They’re like this identity, how are these representatives, how does it move from one system to the other, what kind of trades does it leave behind? And from all that information? What do we need to do other things later on? Yeah, yeah, that’s a great question. So there are two cases, one where a company is a bit more mature, and they have everything pretty locked down going through an identity provider already. And then the other cases where they don’t, where they don’t, then there is probably zero visibility into who has access to what, in the case where they do have things locked down through an identity provider, that and assuming it’s all integrated, and everything, then they can do some level of, let’s say, like, tree traversal, if you will, starting from that root, that it’s basically a good and then traversing down to what access they have, the only thing there is, it’s not granular access. It’s based on groups, or whoever defined groups or roles. And so that’s just a limitation there. But then the question comes in, what if some, well, the problem is like, it’s not as centralized as it used to be. So for example, if someone in marketing decides add a new marketing tool, they can which their corporate card, right, and then now they have access to this new thing that might not be in the view of the team, that’s a security team or IT team that’s responsible for that same thing for engineering, like how many times in large company has Have you been using a very big bug tracking product, and then you’re like, Hey, let’s go use this Trello saying or something like that, that happens all the time. And so, so yeah, even then it can still get out of hand. But there’s the access to resources. But then there’s also the levels of access to so then there’s quite a bit of work that goes into that, then the thing is, like, Sure, you can have a team, like your security, your IT team build this stuff. And relatively It’s easy, right? But then the problem is like, Okay, what do you do on day two? How about me maintaining this thing? Who’s on call for this? And all that stuff? And like, do you really want to do that? Because that’s out of your core competency, like you won’t have been furthering other parts of your security or compliance programs not doing this sort of data engineering work, right. And so, to go back to the other questions, like, why does this or when does this matter? So there’s two parts to it, there’s like if you use the analogy of like, the carrot and the stick analogy, right. So a lot of it is compliance driven, quite frankly, there’s sock two, there’s ISO, their socks, and many other types. And these are just rules, or controls that you have to abide by, for whatever reason, deemed necessary by your company, right. And so that’s the first thing and so the class of problems that are solutions that come out of that are born to solve those would be like access reviews, or compliance report generation, or even like request approval flow. And so but then after that, like, that still can be different levels of manual. So then you want to automate that as much as you can. Because, as you said, like ICUs, right, people downline might not have the context to, to work with this type of thing. Like imagine, you know, I’m a manager. And I’ve been here, right, I’d like, it’s at the end of the quarter, quarterly planning is coming up, I have to attend QBR. There’s other things going on. Meanwhile, Slack yells at me with 60 permissions that I have to review and approve by the end of the day. What do you think I’m going to do? I’m just going to hit yes, sadly, and so that might get me through the compliance, but it doesn’t necessarily get you through the security part. And so at the end of the day it becomes worrisome. Because you know, then there’s liabilities there, right? Could be fines, or violations, or etc, because it could be inaccurate, or you actually could end up getting breached or something like that. So it matters before like, kind of before breach, there’s pre breach and post breach, I would say. So pre breach, it’s all of like the posture, the compliance, the companies are trying to be least privileged or zero trust, and that’s all cool, but like, just making security better. And then post breach is understanding the impact or the blast radius. So an account got compromised, what are all the things this account has access to? And what levels of access and how do I go in and shut things off? The answer is I don’t want to do any of that. I want a system that automatically does that for me and then tells me after or or depending on the risk of the company. It can have me approve it or not. But the idea, okay, and where does like I’ll be

Kostas Pardalis 29:42
operating in this picture, but you have described.

Jeff Chao 29:45
Yeah, so right now, we’re thinking about this in a few ways. We’re thinking about it in terms of integrating. So the ecosystem of data sources is fragmented. So the integration And we’re trying to solve that as well. But then in addition to that, you have this raw data. And so we’re trying to build out a, let’s say, like a unified, you have an identity. So in other domains, this is called entity resolution. So we built out a little thing that you can see a graph starting from Jeff. And in looking at all the levels of access that I have to which resources, and then there’s like a little search, and I can search for different resources, and it’ll highlight parts of the graph. So integration, identity resolution, and then the last part is automation. So you have this foundation of data, you can integrate, you can enrich it, which is the identity normalization, or resolution. And then after that, you take that data, and then you automate it against some workflows. So then that would be around things like access reviews, or request approvals.

Kostas Pardalis 30:52
And, okay, so we have the identity and this identity, let’s say for each system that it has access to, like, most probably like, this system has its own access controls, right. Like Salesforce has its own like Zendesk.

Jeff Chao 31:12
While there, everybody’s different. It’s crazy. Exactly.

Kostas Pardalis 31:15
Yeah. Like, and then, of course, you have everything in house that, who knows, like what’s going on, or you have systems that can become like, super, like, complex in terms of how like, access controls are, like monitored? How do you connect and align all these things without creating, you know, like, just noise in front of the user? Right? Like, because, yeah, one thing is, like, aggregate all the data. And it’s a completely different problem on how you can make sense out of all this data, right. So how do you do that, like, give us like, a loot synthesizer? Because that’s an interesting, like, data problem. Oh,

Jeff Chao 31:52
yeah, man, this is the funny thing. Because, you know, one could say, like, we’ll create a standard and then, and then everyone follows it, but then you just end up with the n plus one standards, right. So we’ll see how that goes. But there are existing calls out there and standards and people that are trying to do good work on that. But I think for me, this is drawing from the data space, right. So there are three ways to do it. So how, okay, let’s use it. Speak around a concrete example. I want to understand who costs us and who is a GUID in Okta, Casa Costas is an email address in Google. And Kosta is, let’s say, like an im policy in AWS, or cost us is a mapping and a YAML. File, right on a service. So how do I understand what that is? There are three ways to do that. One is, you can do a direct mapping, if it’s so easy, like email address to email address exact match. The second way is using a turistic, or rules based matching. So let’s say, No, we have GitHub as well, let’s add that GitHub usernames. Those are unusually personal accounts, right? If you had a GitHub account that was prefixed with my company name, hyphen, username, you can apply that heuristic or that rule for other identity sources. The third one is where both of those fail, if there are zero attributes that you can look at to map them together, then that comes with inference. Yeah, so in France it is like how you infer who someone is, and you do that through their behavior, the things that they have access to, similar to their peers. And so now we’re getting into a lot of classification, or some sort of graph clustering like that. So those are the three ways that I see today without a standard.

Kostas Pardalis 33:47
Yeah, that’s super interesting. Because like, okay, you know, like, one thing is monitoring on a syntactical level, which it’s hard to return, right? When you have a meal. And then you have the Yamo file and an XML document. I don’t know. Like, with that, I can have fun. But there’s also the semantic level, right? Like, what’s the meaning behind these things? How aligned the company and like, you can see that even with, like, and I bring these because like, when it comes to access control, my experience is mainly like with data, you have role based access controls, and then you have attribute based access controls. And like at the end, they’re supposed to be doing the same things, but like, in a different way. But how do you transform one to the other? Like, it’s not that trivial, right? Even if they represent the same things exactly. Because like, the way that we represent things, or what we mean, or we implicitly mean like in these things, like, it’s not easy. So how do I like it? I find it super interesting. And by the way, like, it’s not solving security. I think Eric can talk about identity resolution in marketing, right? It’s like yeah, you rallies, like who is doing what and how to create like these identity graphs there? So you mentioned some occasions and protocols. Can you tell us a little bit more about that? What, what’s the standards out there? If there are any?

Jeff Chao 35:15
Yeah. So there are a couple of things that I’d like to address. One is open policy agents, and specifically the rego language. So that’s like defining policies. So that we’re thinking of using that in a way that we can have some standard around defining policies in a sensible way. And then evaluating them as well. And then, on the API side, there’s skim, the skim Pearl gall. So that’s mostly like detecting changes upstream, and then listening to them. And then applying like permissions changes around users and applying them downstream. There’s also a read component, it’s just crud on rest. And so there’s a read component to that as well. There are a number of open source or source available, I would say, probably projects out there, which are attempting to have like a standard around ingesting these types of sources. These types being like any external, any SaaS application, really, and then having some sort of like Interface or API around that. And so yeah, I would say there, yeah, those are the main ones. Yeah.

Kostas Pardalis 36:30
Okay. And then when it comes to, okay, these are like the polishes, right? And like how we can label the topic, the formal part? And like the Yeah, what are the fine things? How should they ideally be right? You have to start tracking what’s going on with these systems? So I guess there you have, like, different types of data that you need, like to collect, like probably logs? Or like, I don’t know, like, So, what’s there, like, what’s the part of like, the behavioral part of the identity that you’re tracking how it looks like? And how do you collect that?

Jeff Chao 37:00
Yeah, yeah. So there’s three types of data: identity access, and activity. And so identity, again, there’s human and machine. And so that can come from any, you know, identity provider source, that’s generally like a REST API, hopefully, a snapshot base. And then access data would come from those same sources as well. But access data might also be coming from, like a resource itself, like, because like, like, you know, any OLAP, or all TP databases might have like, you know, permissions embedded in there, right? And so you could get it from there. And then activity data, that’s just a fancy word for logs. So in the security space, there’s Sims si, e m. And so that collects everything, or there’s other flavors of Sam like XDR EDR, like extended detection and response, etc. And so basically, those are, you know, like, ElasticSearch s, looking things. And so, the same patterns, right, you’re ingesting from API REST API, the schema schema is just a schema, right? And might have different schema or envelopes. you’re ingesting directly from data stores or data sources, like an OLAP database, or like an event queue or something like that. And then you’re also ingesting logs or searching for diseases.

Kostas Pardalis 38:24
Yeah, well, it sounds like a little data. Is it? A little data? Probably?

Jeff Chao 38:31
I would say it depends on the size of the company. But I would say hopefully, the number of groups people use a lot of role based, so hopefully, those aren’t too large. But we’ve seen them to be pretty large from our customers. Yeah, like there’s twice as many admin roles or groups than employees. And that’s not a good idea. But, so, the light in terms of the like, the number of items, it’s not that much, but like if you’re thinking if you want to listen for changes on those, that could be a bit more the identities and access, like that changes a bit more frequently, then the sheer number of it. But then when you add in the activity data that is the long tail. Yeah,

Kostas Pardalis 39:19
like that’s what ‘s triggered by these reactions for me because we’re talking about like, the logs logs company. verbose, right? Like it’s, yeah, there’s a lot of data there. And there’s a little bit like, okay, look, processing what needs to happen because they are like, semi structured data is like toggle, like a necessity, like a JSON.

Jeff Chao 39:38
Right? But logs are so funny. It’s like, how do I say it? It’s, it’s like, barely, not valuable, because it’s raw, and it’s coming, like the logs might have might be holding a lot of sources of data, you know, that may look differently and yet, it’s still All so valuable at the same time, if you’re able to structure them and extract the right insight that you need, because it’s kinda like, you don’t know what you don’t know, you know? So like, insecurity, like, it’s, you kind of want to know as much as you can, obviously, depending on your risk tolerance, but yeah,

Kostas Pardalis 40:21
so okay, like from all these different data like, how do you like what kind of modeling you do on top of that, right? Like, because somehow you need to connect all these things. You have, like different civilizations like swimming, like more low level, like stuff like that, like just like SOAP different, right? Like, do you have, as we say, semi structured like logs? And then you have identities that that’s like records on a database, right? It’s the opposite show? How do you deal with that variety of data that needs to be homogenized somehow?

Jeff Chao 40:58
Yeah, pretty standard way, you ingest the raw data, and then you TTL it if you need, and then you have a sync systems that kind of are able to process and reprocess the data to normalize it and do some sensible representation, then we actually, that spits out, one of the datasets we spit out is the, you know, the resolved identity. And so that’s just a single data set. And then, yeah, and then that stored somewhere, and then that’s it, it’s pretty, pretty standard here. I think in terms of serialization, you know, like, on the ingest a lot of services through rest, there might be different envelopes on it, were able to handle that, and then on the egress right now, we haven’t done it, but we’re looking to us, the whole idea is to not build this walled garden, like we want to give control to our customers. And so like, you can bring your own tooling, like bring your own database, bring your own BI tool. And the reason is because like, this data should be accessible by not only security engineers, but IT admins or maybe data engineering with security, focus, as well. And so why would we want to build a tool that you aren’t using today, like, there’s already amazing tooling out there. And so we want to use this, like, specific table formats, the query engines that are available, and you can just plug them in, will host the data for you. If you don’t like that, then we can do things like bring your own encryption keys, or you can host the data yourself, if you dare. And then yeah, so interoperability is pretty huge for us.

Kostas Pardalis 42:52
Make sense? And give us an example of like, the first, let’s say, insights that someone can get from these homogenized and processed data set that you create, that it will be nice if it’s something that like, you know, someone who’s working in that space, like before using something like I’d be like, it would be hard like to get this.

Jeff Chao 43:18
Yeah, some simple questions. Once you’re connected, it’s just like, how many admins do I have to which systems in my company? And so that’s the first question. The more interesting question on top of that is transitive access, like how did it cost us to get access to this RDS instance, this table within an RDS instance. And he got access because he’s part of this group, which is part of a group before that. And Eric had added cost to that group. And that’s how he has access. And then the third thing is really around. Like, we have we. So aside from analytics, we use the same thing to just run like continuous queries, so then you can basically throw an alert. So like, I know how many admins I have now, like, alert me on slack if I if that goes beyond? Yep, any of that. We have 10 today, hopefully no more than 10. So alert me on that. So. So like, that’s the beginning of building automation.

Kostas Pardalis 44:23
And one last question from me. And then I’ll give the microphone bugs, Eric. Like, from your experience so far, like with the customers and the users that you’re talking with? What are the first and most, let’s say, obvious systems that they bring in and they try to get insights from because, like, from what I understand, when we’re talking about identity, they eat everything, right? It can be a source application that can be like your cloud infrastructure. It can be your database systems, it can be like, unlike pretty much like everything. So what’s the most common and the first let’s say use cases. You see, but in terms of infrastructure, are they struggling today to have a good monitoring of identity on it?

Jeff Chao 45:08
Yeah, I’ll frame this in terms of user persona. So the first one is, I’m a head of security, or that’s responsible for it, and I just joined the company. WTF, what’s going on? I need to have some insight into who has access to what, that’s number one. Number two is we have an audit coming up. And I need to understand who has access to what so then I can do any remediations. And number three is, oh, no, we’ve been breached. I want to understand this. That’s a bit bad because it’s more time bound. But I want to understand what the blast radius is. And so really, it’s about like, number one is understanding the state of access. But then ultimately, that honestly, matters a lot less compared to actually doing the thing that comes after.

Kostas Pardalis 45:57
Make sense. All right, it’s all yours. I’m sorry, you have also more questions?

Eric Dodds 46:03
Hey, well, this is so interesting. Costas read my mind here, which makes sense, because we’ve been doing this for a couple years. And of course, I come from the world of marketing, where we talk a lot about identity resolution, and going into this conversation. Part of me thought, okay, an organization in some ways, is a closed system, right? When I think about marketing, there are all these external touch points that I have zero visibility into, right, and I can only understand them in many ways, as you know, via proxies of the way that people come into an interaction with, you know, my company, and then sort of go, you know, through and all that. But if you think about the inside of a company, you know, even though there is a lot, there can be a lot of ambiguity, at least you kind of have, you know, somewhat of a closed system. Right. But the more you talk about it, the more I thought, I mean, you could really just change out some terminology and be talking about identity resolution in general. Do you agree with that? And what do you do? Are there things to learn from customer identity resolution, in the way that you solve that problem? You know, inside of an organization?

Jeff Chao 47:16
Yeah, I think certainly the techniques will be the same, there will always be the nuances. But yeah, that’s where we draw inspiration from it’s the identity resolution, a lot of people have done a lot of work that came before this. So it’s nothing new in that regard, within the context of an organization. That is true, but I would just be careful in thinking of this organization as like, a very static walled thing, like organizations, by itself amorphous in many ways. No, like, how many reorders Have you been in, in the past? For a large company? How many contractors come in and out services that are built and torn down? How many employees join or leave the company? So it’s really tricky, because there’s fragmentation and change. Is that really that only constant to throw that cliche out there?

Eric Dodds 48:11
Yeah, no, that is actually very interesting to think about. Because when I think about it, from a marketing perspective, there are all sorts of entry points, and then certain pathways that you can go through, but there actually aren’t that many pass through systems, interestingly enough, which is much, much more complicated inside of an organization, right? Because you have, you know, an individual identity traversing hundreds of systems, right, whereas with a customer, I mean, they may be in lots of systems, but their journey generally, you know, follows a fairly defined path.

Jeff Chao 48:46
Yeah, and that would even be the better case too, because a lot of times, there’s not even an individual identity, like through an identity provider, like you might have a company might have done some number of m&a in the past year, and those companies each brought their own identity provider, and yet, you’re still under the same ticker symbol. Yeah, yeah. And so you’ve gotten a lot of that.

Eric Dodds 49:09
Yeah, it is. It just does sound so funny about thinking about concepts like fingerprinting your own employees. Then like a big brother way, but you know, live from a securities employee. That’s just an interesting concept.

Jeff Chao 49:25
Yeah, I hear you. That crossed my mind as well. Yeah.

Eric Dodds 49:28
Okay, well, we’re at time here, but I do have one more question for you. So when you have built, built amazing technologies that have come out of, you know, some of the most, most interesting awesome companies in the world, but now you’re building a new company in one thing that I like about going through that experience is you sort of get a chance to explore a lot of different things that maybe you are more limited in just because of the scope of your role or the project or something at a larger company. In building abi, have you run across any interesting, new or old? Or different technologies that have been intriguing to you?

Jeff Chao 50:11
Yeah, you know, I’m an active follower of the streaming space. That’s my bread and butter. So eagerly awaiting the developer experience to get better there. It’s funny. Listen to the talk shop or a shop talk with you and cost us about streaming, real time versus streaming debate. Got a lot of opinions there. But let’s

Eric Dodds 50:30
do a follow up shop so I can have you on.

Jeff Chao 50:33
Yeah, it’s great. I’m really glad that the title was explicitly real time versus streaming, because those are not necessarily mutually exclusive. Right. So thanks for that. But that aside, some streaming stuff, because we do have some streaming component on our side. And I don’t want to build yet another thing. I don’t have time for that. Yeah, the other thing is, we’re thinking about a lot of graphs. And so thinking about graph data stores, or graph relational stores, and then also around more like standardization around some like security, metadata protocols, like skim, I would say, are other things that are like that. Also, permission stores are very interesting to me, there are a number of players out there as well. And so everything is on the control side, we’re not so much thinking about enforcement, which is a different approach, but a different set of problems and technologies. And so anything around control. That’s really interesting to me right now. Very cool.

Eric Dodds 51:34
Well, I definitely want to break so let’s make sure to get Jeff back on for a follow on shop talk on streaming, because we would love a hot take. And Jeff, thanks again for joining us. And Abby sounds awesome. And best of luck with it.

Jeff Chao 51:49
Yeah, thanks. Thanks, again, for having me. And it’s good to see you all again.

Eric Dodds 51:53
Okay, what I loved about that Kostas was, I think a couple of times, Jeff said, I don’t want to build another like streaming service, which I love in a couple of levels. Because, you know, obviously, if he’s done that, you know, companies like Netflix and Stripe, you know, these sort of seen a lot of angles of that problem and solved it at a scale that, you know, many of us, you know, will never see, you know, so I just loved it was, in some ways, like a humble way to, you know, to acknowledge that he has solved a lot of those problems. And how cool is it that he’s at a point where he’s like, don’t, you know, there’s, it’s not intellectually stimulating for me to continue to focus on problem areas like, Man, what a place to be. That was really cool. And, and then also, obviously, at the very end of the show, I was just super intrigued about the parallels in identity resolution from my background in marketing, and the, how similar that is, to the problem they’re solving inside of a company. Now, obviously, the security concerns are certainly very different. But that is really fascinating. And I’m sure that I’ll be thinking a lot about that this week.

Kostas Pardalis 53:02
Yeah, 100%. Therefore, the baseline is always fascinating. When it comes to that, like one of the greatest things around like, software engineering, and computer science, like this whole industry is abstraction. Right. And it’s very interesting to see how the same abstractions apply to different problems. And how you can implement similar terms to solve problems, like in very different areas, like from security, to marketing, but at their core they are the same. That’s always like something that I find, like, super, like, fascinating. It’s one of the reasons that, like, Okay, I love the things that I’m doing, like, why I work in this space, and why I like computers, and like, all that stuff. So this is exactly one of these cases, of course, like, okay, the implications of the solution. And the problem is very different when we’re talking about security or marketing or something else like, but that’s what makes it interesting, right? Like, you can build something. And that’s what we share with people like James like you have someone who is okay, like he builds data infrastructure. And now she can take all these experiences as a novice, like applying to a different domain.

Eric Dodds 54:28
Yeah. That’s beautiful. Yeah, I agree. I love it. And I definitely want to get them on a shop talk. I think that would be awesome. Yeah, let’s do loads. Absolutely. All right. Well, thank you for listening. Subscribe if you haven’t told a friend, Jeff is a subscriber. So if you want to be like Jeff, subscribe to the show. And we’ll catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 125:

Authorization Is A Data Problem with Jeff Chao of Abbey Labs

February 8, 2023

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter