Episode 182:

Building a Dynamic Data Infrastructure at Enterprise Scale Featuring Kevin Liu of Stripe

March 20, 2024

This week on The Data Stack Show, Eric and Kostas chat with Kevin Liu, Software Engineer at Stripe. During the episode, Kevin discusses data infrastructure challenges and the development of data products. He also shares insights on the importance of metadata management and the role of catalogs in maintaining data consistency across various systems. The conversation also covers open-source projects like the Python Iceberg library and the future of databases in the cloud, the ease of use of internal tools, the integration of data for builders, the balance between simplicity and functionality in user interfaces, and more.

Notes:

Highlights from this week’s conversation include:

Kevin’s background and work at Stripe (0:31)
Evolution of Data Infrastructure at Stripe (2:18)
Kevin’s Interest in Data (5:29)
Software Engineer or Data Engineer? (8:27)
Speech Recognition Work at Amazon (11:06)
Efficiency and Cost Management (15:50)
Metadata and Query Analysis (18:38)
Surprising Discoveries in Metadata Analysis (21:43)
Optimizing Cost and Value (23:55)
Product Sizing Stripe Data (26:39)
Popular Tool for Data Interaction (30:08)
Enabling Data Infrastructure Integration (35:22)
Value of Data Pipelining for Stripe (39:32)
Next Generation Product and Technology (43:54)
Maximizing value in a decentralized environment (51:34)
Future of open source projects in data infrastructure (57:59)
Final thoughts and takeaways (59:02)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. We are here on The Data Stack Show with Kevin Luke. Kevin, thank you so much for giving us a little bit of your time today.

Kevin Liu 00:30
Yeah, thanks for having me.

Eric Dodds 00:32
All right. Well, you’ve done a couple of really interesting things in data. But just give us your brief background. How did you start? And what are you doing today?

Kevin Liu 00:40
Sure. I’m currently a software engineer at stripe. I’ve been working there for around three years. I’ve been working with data infrastructure there. So a lot of open source technologies, such as Trino Iceberg. My team powers our internal BI analytics. And recently, I’ve taken on another challenge on the data product side, the product is called stripe data pipeline, we essentially enable merchants to have their stripe data back into their warehouse into their data ecosystem. efficient way.

Kostas Pardalis 01:22
This is great. So actually, I know you’ve given for a while now we’ve been talking like, seems like the times where I was the time when I was like at Starburst, and about Trino specifically. And I’m very excited today because I mean, I had the, like the, like, let’s say the opportunity that was like to work with stripes quite a few times. And it’s one of these companies that they’ve been around for, like long enough to, you know, like go through many changes, but always like trying to stay up at the forefront of what is happening out there. For example, like a very early adopter of Spark, right? I’m pretty sure you probably still have pipelines in Skyline there because of that. And you keep innovating, you’re open, like using new technologies. And many things have happened in the past 10 years, let’s say. So having you from there, and you being like enough, long enough, they’re like to see these past like, three, four years, like the evolution. And they will give us a great opportunity to talk about what data infrastructures are today. what others say some interesting problems are. And also based on what’s like your latest, like moving to like, turning data into products. Talk about that, because I think it’s like a very important next evolution state when it comes to infrastructure around data. Right. So what I’m really excited about today, well about you like what are like few things that you’d love to talk about? Yeah,

Kevin Liu 02:59
I think, in general, I’ve been really happy working at Stripe just because, you know, the company for its side of the engineering culture there. It really helped me learn and get to understand a lot of what is going on, especially in the data world, kind of like, you know, what is the newest and shiniest thing that we can work with? Right? So, you know, I took a database class in college, didn’t think much of it came to stripe, started working with, you know, OLAP systems Trino Iceberg, and it was very new to me. But then eventually, I started to realize that it was new, kind of to the industry as well. And that’s been really exciting to me in order to say okay, well, you know, how do I take this new concept? How do I run it efficiently at stripe? And then how do I help the community? Because it is an open source project? How do I kind of take ideas that we have that we come up with, and share it with the community as well. And then on the data product side, I think Stripe is positioned very well to do, you know, data sharing. Not a lot of companies can do that, because not a lot of companies have, you know, the value from the data that they have, and have that kind of be shared to their customers in a way where the customers are asking for on a daily basis. So, you know, I’m still learning. I think I just want to share some ideas with you guys. And yeah, happy to talk more about things. Yeah,

Kostas Pardalis 04:51
let’s do it. What do you think, Eric? Are we ready?

Eric Dodds 04:56
I was born ready. costus I was born ready.

Kostas Pardalis 04:58
I know that let’s do Let’s do it.

Eric Dodds 05:02
Kevin, so excited to have you on the show. And we have some really exciting subjects to talk about. You gave a brief introduction, but I’m interested to know, sort of going back to the beginning, you know, what sort of sparked your interest in the data side of things. So you have a background as a software engineer, but what drew you into the data aspect of software engineering?

Kevin Liu 05:29
Yeah, I think I always like to just dabble around different domains on the internet. And I think data has been one of those things that just stood out to me, I think in terms of my work, right, where, you know, I work on making data infrastructure at stripe and kind of making it so that, you know, 1000s of stripes have access to data that, you know, that’s just been really interesting to me to, like, see how, see how that kind of evolved from your traditional, like data warehouse. And I think the open source aspect of it also drove me to kind of participate more, to join communities to kind of learn from each other and kind of, to share what I’ve learned, I think that’s been really kind of motivating for me to kind of work in this field. And obviously, you know, in the recent months, years, there’s been a lot of new developments, you know, as watching, like, the history of databases, and a, they’re like, you know, calling this like, you know, data lake house, as like a new wave. And, you know, in a way I do believe that is a different paradigm from before. And, I mean, I see it firsthand and enable a lot of interesting kinds of features and kinds of value to derive from there. Fun fact, we used to run on Redshift, until we couldn’t run on Redshift anymore. He migrated to Trino, and to Iceberg with open source technologies, and we see firsthand how much value it provides to the company. And, you know, folks who use it on a daily basis, I think it’s magical, right, that we’re able to, you know, analyze petabytes of data. Super, super fast. Yeah, right. And we I stripe, especially we have a way for, you know, people to interact with data very easily, like, we have an internal tool that you can just go to, and you can just write some SQL. And so that approach of, you know, democratizing data, folks at the company, has been very good. Very well accepted at stripe.

Eric Dodds 08:03
Yeah. I have a question. This is, you know, you’re so your title, a software engineer, but you work with a ton of data stuff. Just out of curiosity, do you kind of consider yourself, like more of a software person or a data person? I know, that title can be a little bit abstract, because it can mean so many things, right? And in some ways, you know, building a data platform, you know, is what you’ve been doing. But yeah, just interested in your perspective on that.

Kevin Liu 08:27
Yeah, sometimes I think to myself, too, when I first learned the term data engineer, I’m like, I might that I didn’t hear, I don’t know, I’m not sure my I mean, my day to day it goes from, you know, SQL to front end to bi, to distributed system to like, you know, every part of the data infrastructure, we kind of have some kind of lever that we can pull. In a way yeah, a lot of what I do is considered data engineering, but I think, especially on the data infrastructure side, there’s a lot of software that, you know, exposes a good interface, but sometimes you really need to dig into the internals of it. And this is where open source and having the community is great, because a lot of the time we’re able to talk with other folks and other companies or also infrastructure, and share what we learn with each other and share with the community. I, you know, went to a Trino fest event, like when he was 2120 22 and learned a lot and I came back to my team. They are like, hey, you know, lift runs their data in their Trino clusters very efficiently. What can we learn from them? So, a lot of those things I really enjoy. And I guess that’s what software engineers do. I’m not Yeah, I don’t know. Like me, we don’t have data engineers. enroll that stripe, so I’m not really sure what, yeah, me either. So, you know, I think I do a little bit about. Yeah,

Eric Dodds 10:08
I mean, I, you know, I think that’s actually, you know, part of the reason I asked the question is that, you know, as we think about, like you said, there’s all sorts of interesting new developments, right in data technology and operating platforms. And so it is really interesting to think about the confluence of multiple different skill sets that are really useful when running, you know, running large data systems. Okay, I have a ton of questions about stripes, but I want to jump back just a little bit. And you worked on some speech recognition stuff at Amazon, previously. And I just have to ask about that, because especially after you talk about, you know, sort of being a data person and a software person, did those two things come together in that work as well? Because, you know, you’re sort of dealing with massive amounts of data, and then, you know, trying to build a system that can essentially operationalize it.

Kevin Liu 11:06
Yeah, I think, in a way, yes, I think I forgot who I was talking to. But I was talking to someone with a lot of years of experience in the industry, a software engineer, and they basically told me that, you know, software engineer and writing software is essentially just moving data around. Hmm. So I think my role in like, in this, like, data engineer, big data world is being a software engineer and specializing in that. I think I work in a speech recognition system in Alexa. And we’re kind of supporting the data science team there. So a lot of the job is, you know, how do we provide the right abstraction for data scientists for it? And they’ll engineer to run their speech recognition model? How do we have the right environment for them to do their work? In a way that produces value? Right? Yeah. And same thing as stripe, I think like, a lot of our work enables folks from other parts of the company to do their job, and to get their, you know, whatever they need, whatever data they need, whatever insight they need, you know, fast and efficient way. Yeah,

Eric Dodds 12:31
absolutely. Well, let’s dig into the world of stripes. So can you give us a little bit more detail on what you’ve done at stripe? What are the big projects that you’ve worked on? Built?

Kevin Liu 12:45
Yeah, we did a bunch of stuff at stripe in the years that I’ve been then I was talking to a coworker before. And we’re kind of reminiscing about how like, but that’s the project that we took on. And it just felt like a decade ago, like, target. So when I first started as a striker, we, like the whole company, were in this big project to support India. And it was really interesting to me, because we had this or India has this concept of like data locality, where it’s not. It’s a bog that, you know, Indian merchant data should not leave the continent. Right

Eric Dodds 13:32
here. Yeah, it stays. Yeah, it stays in the borders. Yeah. So I’m familiar with this. Yeah. which

Kevin Liu 13:36
breaks the concept of like software engineering, and like, you’re in everything, right? Because now all your data is physically in some space instead of like, you know, data as like, blobs in S3. So that’s the first project that I kind of worked on. And that required a kind of a foundational shift at stripe to say, you know, apply this concept all the way down to the stack, and make sure that we’re supporting it every. So that was really interesting for me to see. And yeah, Stripe scales to, like, support this, like, strange concept that’s outside of, of, of what software engineering has taught me. Yeah. And then a lot of what my team supported was our internal kind of data analytics BI product. So we have a very popular kind of internal tool called Hubble, which essentially is just a text box of SQL and a button that you can press for running the SQL. And, you know, you get some results back, right, very simple interface. Very well received, I think The Daily Active user count was in the 1000s. Apparently, the Seattle office and walk around and you know, folks all have it up. And we work a lot on the front end and kind of the back end, which is powered by Trino. The various components. So we had hive tables, we had Iceberg tables. So, you know, my role was really a little bit of everything. And that, you know, recently the, you know, last year, like the year of efficiency, what we worked on and focused on was tracking our spend and seeing like, what exactly are we, you know, paying money, paying, etc, to paying bills. So, we did a lot of work around metadata, and especially attributing what is going on in our infrastructure. So for example, whenever someone presses Run, we want to be able to say, Okay, this query was run. And hopefully, we have a reason for this, right? Yeah. So compound the issue, we also expose an API endpoint. So a lot of construction is done in this SQL format. Right? There, there can be cron jobs, that can be event handlers to say, when this happens, we want to do something, find some data in our data and fraud, and then perform something else. So a lot of that, like, let me get data, let me find data, let me work with data work off of this endpoint. And this is where, you know, it’s very easy to have runaway costs. Once you expose the internal endpoint, and once, you know, everyone in a stripe wants to integrate with it, because it’s very easy to set up. For us under emphasis, very quickly, we need to figure out, you know, like, what is actually happening, and what are we spending money on? Because over the years, we just assume that it’s natural growth, right? Like every couple months, we say, Okay, well, stripe is growing by this modular business growing by this merger. So their computing naturally grows with it. So let’s, you know, turn off our cluster, right, let’s add a new cluster, let’s add new machines. But when efficiency is important, and when we, I mean, we know that over the years, we value growth over efficiency. But when it’s time for efficiency, we really had to hunker down and figure out what exactly we were spending on. Get,

Eric Dodds 17:53
I want to ask you about this so you have metadata on a query being run? How did you tie that back? Or how did you discover why? Because a lot of times, I would think that’s sort of big? That’s sort of the big question. And just, you know, what comes to my mind is that a lot of times analytics projects can be ad hoc, right? Where you need to run a bunch of queries on a bunch of data to answer a question, but then when you answer it, you sort of have the insight you need. And then you sort of move on, right? It’s not like that’s a persistent report or whatever. So how did you figure out why or whether something was ad hoc or ongoing?

Kevin Liu 18:37
Yeah, I think the first thing we wanted to figure out is a big picture of like, what is happening? So we have, we know, there’s a certain kind of, you know, data operations going on. We know there’s ad hoc analytics, we know there’s bi, reporting, and we there’s operational, like, you know, tell me when something like, yeah, we know that there are a lot of these use cases, and there’s ever growing amount of use cases. From the infrastructure side, we treat these all as kind of the same, even though they’re they weren’t, like ad hoc analytics require a different latency spec, then service, right? Like, if it’s a cron job, it just wants to run in the next 30 minutes whenever versus like, if it’s ad hoc, Someone’s waiting. But for us under the Data info site, we wanted to see exactly what is going on, throughout all kinds of the realms, right? So the first step was actually just to collect that data. Right? Do we know how many people’s running ad hoc queries do we know of our computers spent on dashboarding on service queries on this and that Oh, Um, so this is where the metadata comes in. And depending on how you structure the metadata, you can really slice and dice your way into the different kinds of usages. So for us, the first thing we did was like, you know, we know specific services have specific queries. Yep. Like on this website, we have internally those, most people go there for ad hoc stuff. Yeah, right. This cron service that we have is, you know, a lot of these services also build out their services. So this cron service actually has, you know, different teams under them. So how do we ask the cron service like give us more information, license dice that too. So you really kind of get into a realm where, you know, in the cron service, every time you send a query to us, give us as much information as you can about it. And this is nice, because like we own all day in front, right? Like, the codebase is all strides, we can go to that team say, hey, you know, I want to add extra metadata. Every time you send us a query, they’re like, Okay, cool. Like it doesn’t, right? It’s not that big of a deal. But for us, it’s right for us, we see that this cross this very, is from this cron service, which is from this team, which is from this task that runs every so often, you really get into the kind of analysis part of it. And with just, you know, three fields, in your metadata.

Eric Dodds 21:34
Yeah, what I have to know may be like, what’s one of the most surprising things that you and your team discovered when you started slicing and dicing the metadata? Yeah,

Kevin Liu 21:43
so you know, we always know, there’s, like, some inefficiencies in our society. And, you know, at a hyper growth company, it happens. And, you know, sometimes the best thing you can do is to, you know, focus on the most impactful things, and sometimes it’s not cleaning up stuff. So I think once we started gathering data, the most egregious thing we found was that there is a cron service that runs every hour. And what it does is it just runs select Max, I think, like, update it at a table, pretty, you know, pretty simple. Like, I just want to know, when was the last time this table was updated, right, and you’ll just update it. But then when you dig into the details, this table, maybe when it was first started, when maybe when this query was first started two years ago, this table was, you know, a couple of megabytes or a couple of gigabytes. The table is about like a petabyte of data. If not structured correctly, partitioned correctly, so that your max of updated ad is not doing a table scan, a full table scan of like petabytes of data, right. And now you’re doing this in a distributed Trino environment where, you know, you can have like 10 100 machines running, it takes, you know, around two, three CPU days to run one of these queries. And then you see that this query is run every hour on like, on a cron job. So you multiply all those factors. And we’re spending so much computation on this one simple birdie. And then you go back and you say, Okay, well, who owns this? What is it for? Can we tell them? You know, Trino has Trino Iceberg has this concept of like metadata table where you can look at the metadata, instead of doing a full table scan, it’s like, okay, well, this is how we’re going to optimize it, we find the team is no longer around, and they don’t need this. Right? So this whole process where we’re doing this much computation for zero value. Yeah. And there’s a lot that we found that was very surprising. And, you know, for us, it’s great. It’s all savings, right? We can take a lot of these and say, Okay, well, you know, every so often and we’ll just write a report, do some analysis and stop this from happening, but it was just really surprising from our side to find something like that.

Eric Dodds 24:18
Yeah, I guess, you know, it’s, I can see both sides, right. On the one hand, it is surprising to see that where you’re like, Okay, maybe, you know, this one’s the award for most expensive query in the history of the company. But at the same time, I mean, Stripe is, you know, a huge company and growing fast. It was probably a significant need, and things change, right. And it’s, you know, everyone knows, it’s really hard to, I mean, I would guess also with something like that, you know, if you don’t have the context, it’s scary to go back in and touch stuff like that, because it may be running some really important piece of the business. But yeah, man, I can’t imagine I can’t imagine that thing. guts

Kevin Liu 25:01
are the info side, right? A lot of the problems that we have is the disconnect from what these things are used for, which really helped push us to go specifically to the domain and ask, Hey, I see this is happening in our system. Like what is happening? Like, can we help optimize it? Because, you know, you like the domain experts might not know how exactly to write this query to get the same result. But in a better way, sure, under emphasis, we know how to give you that, right. We can now write it in a metadata table. And now you’re reading like, a few megabytes of metadata instead of like, writing, goodbye scan. But like that, this Connect is where you know, this helps facilities as well is to say like, and obviously it will be great if we can automate all of this and no one has to think about it. But we have to push all the way up to the domain and kind of figure out from there together.

Eric Dodds 26:08
Yeah, I mean, my opinion and interested to see if you agree with this is that, you know, it’s not necessarily the responsibility of that end user to understand how to optimize that, right? They’re trying to, they’re trying to pull data so that they can do their job. Yeah, super interesting. Well, let’s change gears just a little bit here. One of the latest projects that you’ve been working on is actually productizing stripe data. Which sounds absolutely fascinating. I know. Costas has a million questions about that. But can you just describe that concept? You know, what’s, what was the sort of need? And what’s the project? Like? Yeah,

Kevin Liu 26:50
So this is how I’ve been internalizing this. Right stripe is an API first payments company, or at least when it first started, that was the flag that we have, right? We have a set of API’s, where you can interact with, and you can work off of the global payments, rail, right, require idea. This evolves into, you know, I have a set of reporting API, right. As a merchant, I do a bunch of stuff with stripe, I facilitate that stripe helps me facilitate a lot of payments. Now, I want information back to say, you know, what, how many, how many payments have gone through, how much money have I gone through with stripe, and either I can keep a system of record on my side, right? Every time I write some information, I also keep some information, or, you know, Stripe built out this suite of products to say, No, I, I am the source of truth. I’m the record keeper. Here’s your information. And let me repackage it in a way that adds value for you, the merchant. And this has evolved from API into something called stripe sigma, which is like an on the stripe website, a way to interact with your own stripe data as a merchant. So you can go off data, Stripe sigma, you can write some SQL queries, press Run, and have some results back. And that data can be like, you know, how much have you processed How much have you, you know, utilize stripe for? Right. But for a lot of enterprise cases, they don’t want to work off of stripe.com, right? They don’t want the SaaS product, they have their own data engineering team, they have their own data infrastructure ecosystem. And they want that data in their system. So they can integrate it with you know, maybe their system record of truth. Right, and they want to add, you know, different features, different values to that data. So that’s kind of where the problem statement is. It’s to say, as a merchant, and especially enterprise merjan, I want stripes data in my ecosystem. Can you give me that data? And there’s a lot of, you know, off the shelf software. You know, Fivetran is kind of the market leader in this where I think they just scrape scrape stripe API, write it down, push it out, write the facility. But on our side, we have all the data, we just need to push it out. You want to make it easier and seamless to integrate with different ecosystems. So that’s what we’re working with. And I think there’s a lot of interesting development in this area from different cloud vendors, different data vendors in this space, and I’m pretty excited to be working on this.

Eric Dodds 30:01
Gosh, I have 1000 questions that cost us. I’m going to hand the mic over to you. Because I bet.

Kostas Pardalis 30:08
Thank you, Eric. Before we go back to, like the data products, case that you just like talked about, I want to go back to the tool that you mentioned, that became like really popular insights. And you mentioned that it was just like textbooks where you could write like, SQL query and run this query, right? And my question is, like, in a world with so many BI tools out there, so many hours spent on figuring out what’s, you know, like, the most efficient way for someone to interact with data through a graphical user interface? Why has this tool become so popular? And what was it like, the need that it was like fulfilling? And could it be, you know, like, be satisfied, like all these, like BI tools out there?

Kevin Liu 31:00
Yeah, that’s a good question. I think, you know, why this tool was made in the first place was kind of beyond my time. But one thing I do know is that I really enjoy using this, and so does a lot of people in the company. And I think, I have been trying to figure out why it’s so popular, why why it’s so successful, I think it’s just one, it’s very simple. The interface is very simple. It accomplishes what you want. So like, you know, you write some SQL, you get some data back, there’s simple filtering, there’s, you know, if you press graph, you can turn it into a line graph, a pie chart, or whatever you want, right. But like, a lot of it, a lot of the like, most used features are easy, like, features with like, reasonable defaults. Right? So it’s very powerful for me to just like, write a query, you know, select Date of whatever and not like, aggregate, whatever, and get a result, press like, turn this into a line graph. And boom, like, that’s all you get, right? Like, if you want to tweak it more, you can like go into, right, right, more visuals and whatnot. But like, I think for majority of folks doing analytics, that’s like enough. I know, for me, like, it’s very useful. And I think Trino, being the back end of it really power is this, like, kind of magical. Wow, it’s so fast kind of thing. And it being federated as well, we’re able to connect a lot of other different data sources. So what we’re talking about with the attribution of like different queries, we threw that into a database and connect it back. And now your data ecosystem is all connected. I can query on this interface, like, how many queries were running in the last hour? You know, from the ad hoc stuff? Yeah. It’s just from the service stuff. So like, it’s just very kind of central to our data ecosystem. And, you know, I was looking at Super Saturday, and I was trying to figure out like, Okay, well, like, can we move migrate to something? Open Source? I think the difference between superset and what we use, at least you know, when I prototype with it on my own time, is this like, very simple defaults? Like, there’s like two or three features that everyone uses and everyone loves, and preset, like, it’s a little bit more difficult to set up things. But that let’s jump in difficulty. Really, is the big factor when you’re working with tool tooling.

Kostas Pardalis 33:53
Yeah, that makes a little sense. And it’s, like, super interesting. And then you also mentioned about, like, exposing endpoints like to work with data, right? So you have, you’re not just offering, let’s say, like, a way for people to go and visualize the data. But you also want builders to go and build on top of the data to integrate with, like, the data infrastructure. Right? Right. So how do you do that? I’m assuming also that, okay. Let’s say the typical use case around like bi and, like the mobile app, like concept is that you don’t have too many concurrent queries, like it’s much more like big things tend to take much longer to complete. It’s a very different, let’s say, set of trade-offs that are assumed there, right? Compared to I don’t know having let’s say someone from the front end engineering team decides like, Oh, now I have this data or it’s like I know like creates you know, these service lead scoring like to be hitting Every second or like, sub seconds or whatever, right? So how do you balance that? Right? Because we’re talking about opening opportunities to, you know, like every possible use case out there. And some of them might not be, let’s say, compatible with the basics like that. Yeah, I

Kevin Liu 35:23
I think that’s exactly right. I think it’s like the API is both a blessing and a curse. I would say, like, it makes it very easily integrated with, you know, all of the environments that we have, all the different languages, because, you know, HTTP is pretty universal, right? But on the flip side, a lot of our compute costs can be reduced by, you know, if you are in the Java environment, and you’re working with Iceberg to just go and use the native Iceberg library, right, instead of Round Tripping through compute, that goes through Iceberg and then back again, you can really just, you know, go and read from the source. So, that’s been something that we’ve been struggling with. And that’s something that, you know, it’s just the optimization at the end, right. But the kind of the procase, for opening up this as an API is that, you know, integration is much easier, getting things done is much easier, right? Like, getting data is much easier, no matter where you’re working on whatever repo, whatever language, whatever environment. Totally with you on, like, a lot of the time, it’s not the best way to do it. But you know, for now, kind of being able to build out these use cases without being blocked by, like, how do I get this data has been very useful, I think, for stripe to build out different features of different products. Yeah,

Kostas Pardalis 36:57
100%. I think like, why a good way, like if this I’m into, like, the culture of the company, right? Like, you promote creativity than control over the resources, then that’s like the trade off of what you’re doing there. And like, makes total sense. And I think it’s like a trade off that always exists, like with engineering, right? Like, when you start optimizing, then he’s a baby usually goes down, unless you narrow down the use case you like, a lot. So it’s like this balance between, okay, how much acceptable I’ll make my systems with how much let’s say I’m going to make them robust and all these things. And it’s always like a dumpster, very delicate there. And it’s very interesting, like to see how this is, like performed in a company like Stripe, right?

Kevin Liu 37:46
I think we over index on we’re not over indexed, I think we value being able to unblock and facilitate product development and feature development. And have, you know, folks not be blocked on accessing data. Yeah, that’s kind of something that I’ve been really fond of working at stripe. Yeah,

Kostas Pardalis 38:10
no, that’s amazing, actually, especially the scalable components like Stripe, right? Because these queries, I mean, at that scale, cost a lot of money. Like, it’s, you know, when you’re like, at that scale, where, let’s say 1%, performance gains, like translates into probably millions of dollars, right? Things are, like, much more complicated. So it needs to be like, part of the culture of the company like to promote that. And that’s, like, amazing, I think. All right, let’s go back to the pipelining stuff. Because it’s like, also, like, very interesting. So as you said, there have been vendors out there for quite a while now, right? Facilitating the exporting, extracting of data and the loading of data like to other systems, right, like Fivetran.

Kevin Liu 39:00
Why

Kostas Pardalis 39:03
Stripe was like to get into that business in a way, right? Why what’s the value of someone like Stripe, which, okay, like the core competence of like, the company is not like moving data around, right? It’s like processing payments. Why it’s becoming so important today, that stripe actually, you know, like dedicating resources to go and find a robust solution for that. Yeah.

Kevin Liu 39:34
I can give you a, you know, what I think is the answer. Right. So, a lot of, you know, stripes are pretty innovative and that like a lot of the features that get developed at the end of the roadmap, a lot of it is driven by the customers themselves. So you know, you probably, you know, go on Twitter, see a bunch of people product leads co founders at us, Hey, how do you want to see stripe improved? Right? What part of it? Do you want to see improvement? We have Friday firesides, where, you know, other company founders come in to talk about how they use stripes? And the question is, you know, what don’t you like about it? Where can we improve? And I think with that mentality, a lot of the data side has been, you know, a natural progression of like, what the customers want, right? So, Stripe sigma, the, so it’s essentially a SaaS on stripe.com, where you can write SQL to interact with your own data. So that was the first iteration. And it’s very similar to what we have internally, where the, you know, just a website, a sequel dialogue, and a run button, and it returns to the data, right? So that that came out of like, you know, customers wanting to interact with their data, right. And like, for SMBs, people without their own data infrastructure, that’s pretty good. Right? You go and do a bunch of SQL analysis, just through stripe. And then for enterprises, they don’t want to use that. Maybe they, you know, maybe their data size, or their, you know, regulation, just privacy, for some reason, they don’t want to use that product, but they still want to interact with this data. So there has been a need to provide this data to our customers. And the need is pretty validated, right? Like, you have other companies who, you know, these merchants go to to say, hey, I want my stripe data. Can you give it to me? I don’t care how, just give it to me, though, I’ll pay you for it. So then, the natural progression is like, Well, why go the extra step. And a lot of the time, like, you know, the way that these companies get data is also pretty costly, right? They ask that like they call the API’s, write them down, send it to other companies. So the natural progression is like, Okay, how do we do this in a way where our customers benefit. And we can also turn this into a product. So that’s kind of been the line of thinking. And I think the way that it was started at first was a customer asked, like a pretty big customer asked for this. They’re like, Hey, I don’t want to work off of your website, I have my own data engineering team, I have my own data engineering ecosystem, just give me the data. Let me do what I want with it. And then, you know, more and more companies come in to ask for this. Yeah, right. The way we see it is there’s a segment of like, you know, SMB can use sigma enterprise can use stripe data pipeline. Yep.

Kostas Pardalis 43:02
Make sense? So what’s the difference between someone using let’s say, like a third party vendor that is going to continuously hit like the API of stripe to export data, and reload the data on, like, the S3 bucket of the customer? With what stripe do their pipelines? Right. And let’s talk briefly about, let’s say, the product experience, if you can talk about that. But also, like, most importantly, about technology, like what’s, what’s the difference there? Like, in one case, we have HTTP, right, like, as you said before, like, it’s pretty inefficient, right. But it’s pretty universal at the same time. But maybe there’s a better way to do that. So what are the technical choices that you as an engineer make to enable, like, a different product experience at the end? Right.

Kevin Liu 43:53
Right. Yeah. And this is where I really believe the kind of the next generation for this product, right? Like, if you go to stripe data pipelines right now, we have GUns, Redshift, and Snowflake. As a merchant, you can sign up for this product. And you can get your stripe data in your Redshift cluster in your Snowflake cluster. And we do this in a way where, you know, we get our data from, you know, our source of truth, right? The reliability factor or the kind of data consistency data correctness factor, we take that on and we guarantee that, like, you know, in a way where you know, anything that happened upstream, we can just say, here, like we calculated source of truth. Let me push the data out to you. That’s very difficult when you have a man in the middle like with a third party vendor. I’m sure there’s a way to solve it. But you know, at the end of the day, going from the source is a lot cleaner, it’s a lot easier for both a stripe and the merchant, both like no API calls are expensive. Right? If you know, if you’re scraping, you know, a website, the API calls get super expensive, when you’re scraping stripe, there’s a cost to strip as well. And internally, like migrating all those like API calls onto this product is just a win. Win, I think, in terms of technology, something that I’m really interested in, is just the idea of data sharing, right? You know, an API call is one of the SFTP. A lot of these things are very old, not all but like, you know, they’re like proven methods from, you know, the 80s and 90s. And with a lot of the developments in the data space, they are interest based, and especially with a lot of cloud vendors with a lot of live data vendors, innovating on a bunch of different data sharing technologies, I think Stripe is in a good position to piggyback off that. So then we can offer our merchants inter-integration with all of these ecosystems. So something that, you know, that has been going on in the industry is like the rise of Apache Iceberg, right? So something I just saw recently, I think last year, with Salesforce and I think Snowflake, there’s a blog post that says they’re integrating Salesforce data with Snowflake, one click or zero Qlik, zero ETL, whatever, right? Like, you can get your Salesforce data in Snowflake super fast, super easily. Right? We see the same thing for stripes. Right? We want to give you your data on Snowflake Databricks. Yep, AWS Azure, like anywhere that your data internet setup, we want to be able to give you that data. And I think with the rise of the lake house kind of architecture, where computers are separated from storage, that really helps our case, because right now we publish to specific warehouses, right, it has Redshift, it has to be Snowflake. But with this, like Lake House architecture, we want to publish the storage, and you bring your computer, and the integration should happen seamlessly. So you know, we can use Iceberg. We can use, you know, different technologies to facilitate this. But the core concept of like, we’ll give you the company will give you the storage, you bring your computer, I think it’s very exciting to, to me for like the next iteration of this product.

Kostas Pardalis 48:16
So just understand like about the, like the use case here with like Iceberg. The way that you see it is that the data leaves, let’s say on stripe, but the user has, let’s say the capability to choose where to expose this data through Iceberg, right? So external equity and just like, go and query that, or you see you have more of like, okay, this is your data, we’re going to export it on your own S3 bucket, because that’s your storage. And you want to have it there. And we are going to do that by using Iceberg. So it’s easy like then to win, like exposure, like two different query engines and Dwolla. stuff. What one of the two approaches usually, is more favorable, like for the users out there?

Kevin Liu 49:13
Yeah. I think that there’s like multiple levels of abstraction, right? Like, at the core, we’re exposing some data where the merchants want to be able to interact with that data, right? We can throw it into the SFTP server as CSV, right? Or we can throw it on to Azure Azure or AWS S3 as Parquet files. Yeah. And then it’s about being where the merchant is and their ecosystem into our own ecosystem. Right. So Iceberg is one of the abstractions right, we can throw our files on S3 and create some kind of catalog for I like to represent Iceberg. And the reason why Iceberg is so popular is that it was so interesting for us because like all of these vendors, all of these computer systems are now integrated with Iceberg. Yeah. Right. So this is a step kind of removed from us an extra step that we don’t have to do, where if we just deliver something in Iceberg, you can read it in Snowflake, you can read it in Databricks. You can read it with Athena, with a mesh chef. It’s about us taking the data and making these levels of abstractions so that our merchant can integrate it in a better way. Yeah, right. If our merchants want delta tables, right, we have underlying files, we just need to generate some metadata and have delta tables. Yeah. So for us, it’s about thinking through, like, where we want to be and our users and where they are, where their ecosystem is, and kind of meeting that demand on our side and enabling them to get the data.

Kostas Pardalis 51:07
Yeah, it makes total sense. And one question here, because, okay, I think the value of decentralizing the data, like in these ways, is abuse rights, both like from like an engineering like perspective, in terms of like efficiency there, but also from, like, a business perspective of like, having like to, like use one other different tools, and all these different vendors and like, paying for all that at the end without having like the best possible experience on the end, right? And like maximizing your value. My question, though, is like, Okay, in this highly decentralized environment, with all these different options, how can people keep track of what is available to them, right? How they can find the data that they need, how they can know that this is the right data, like, yes, of course, you can create some metadata and creates Iceberg tables and have like a catalog that a system can go and access, and it can be like a hive meta store, right? But then if you go to something like Snowflake, then probably, you need a different catalog to be populated there for that to happen, right? So we get to these meta problems in a way of how do we keep consistent also and available all these metadata that are needed in order for people to go and figure out what they can use and how to work with it? Right. So first of all, do you think this is a problem? Or might be just in my mind, right, like, I don’t know. And if it is, like, what are like the multiple solutions out there?

Kevin Liu 52:48
Yeah, no, I think it’s definitely a problem. What Not a problem. It’s just the way that it’s set up. Right. Iceberg and any table format, it’s essentially your data with some metadata. Yeah, right. You have to keep your metadata somewhere. And for Iceberg, it’s like, catalog, right? The catalog just does the translation of like, here’s my table. And here’s everything I know about this table, where it is, you know, you have a hive meta store, you have Glue, you have, you know, a rest catalog. I think this concept of like, catalog is super interesting. Like, when you’re talking about these table formats. It’s essentially like the abstraction where a lot of these vendors are taking to not lock you into their ecosystem. But it’s one of those things that’s difficult to work with, when you’re across many ecosystems, right? So you can have a Snowflake, you can have an Iceberg table and a Snowflake. But if it’s managed by Snowflake, it’s their own catalog, right? And if maybe you’re like an enterprise, and you have multiple different ecosystems, you want to use Snowflake, and Databricks and something else and like Athena, right, where your catalog is, determines which systems you can use. Yeah. So if you have an Iceberg table that’s in Snowflake, only the Snowflake catalog, it’s really difficult for you to use that in Databricks. If you have a unity catalog, which also works with Iceberg. It’s hard to export that and put it into Snowflake. Right now you have integrations between these catalogs. Right. And this is where Iceberg is kind of innovative with the rest of the catalog. I think it’s very interesting that the director says there’s a rest protocol. It represents a catalog. It can plug and play whatever backend you have. Right. And it’s a level of abstraction that kind of does away with the details and like the vendors and everything. I think what it means for us is, you know, we’re still trying to flush out like how this works, right? Like if we want to integrate with table formats. Where are we going to store a catalog? Yeah, you need to store multiple copies, right? Like, do we need one in Glue for? AWS? Do we need one in Unity for Databricks? Like, now you have like this, like, kind of locked in on the catalog level? Yeah, we get it out of that. I think those are like interesting questions. A lot of like, the integration is happening to, you know, like, Glue is able to be read and other places. But what these vendors do, a lot of it is, we make it easy for you to read in other catalogs, but we make it hard for you to read out anything that we have. So you know, it’s an interesting kind of time period, that we’re

Kostas Pardalis 56:18
that makes total sense. Okay, I think we should come like another episode, just like talking about catalogs to be honest. But we are close to the end here. And I would like to let Eric ask any other questions you might have. So very, all yours again.

Eric Dodds 56:32
Yeah. Kevin, I think, you know, there’s this, it’s been so interesting to hear you talk about just a lot of the practical ways that you’re solving problems, you know, day to day with your infrastructure. But you are a very curious guy. And so I’m dying to know, when you look at the data landscape in general, what are the most interesting new projects that are exciting to you, maybe even in open source? Because I know, that’s exciting, you know, when you sort of remove yourself from the limitations of the infrastructure you work in every day? Yeah.

Kevin Liu 57:05
I think Iceberg has definitely been on my list. I’ve been kind of participating in the Python Iceberg library, just you know, contributing there. I think a lot of the kind of this aggregation of like different database components and like, OLAP components, right, like, I think of our current infrastructure, as databases kind of just turned inside out. And different services, essentially, yeah, no, it’s a computer. S3 and Iceberg are like storage. And now people are building indexes like, all these features on the side. So I think a lot of what interests me is like, you know, like Apache arrow, right? Like, well, then you can integrate these systems together, like data fusion, where you can have like components of your traditional databases and work with it in a way where, you know, you can have your planning, have your compute layer, have your storage layer in like, in like different libraries, and then you can mix and match. So a lot of these, like, foundational core pieces of the database are now being ripped out. Yeah, really into these open source projects. So you know, I’m very interested in seeing the development of those. And there’s like, a lot of like, active involvement in those fields. And we’ll see, you know, like, maybe we’ll, in a year or two, we’ll go back to like, what a traditional database looks like, but just in the cloud with like, all of the bells and whistles. Yeah.

Eric Dodds 58:50
Well, Kevin, this has been such a great conversation. Thanks again for joining us for the show today.

Kevin Liu 58:55
Yeah. Thanks for having me.

Eric Dodds 58:57
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 182:

Building a Dynamic Data Infrastructure at Enterprise Scale Featuring Kevin Liu of Stripe

March 20, 2024

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter