On this week’s episode of The Data Stack Show, Kostas and Eric are joined by the risk data engineering manager at Intuit, Alex Lancaster. Alex has been with Intuit, known for its products like QuickBooks, TurboTax, Mint and more, for 15 years and was part of a recent massive and successful re-architecturing from on prem to cloud-based.
Highlights from this week’s episode include:
The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Eric Dodds 00:06
Welcome to The Data Stack Show. We have Alex from Intuit on the show as a guest today. And my burning question that I want to ask Alex, is, he’s been at Intuit for a really long time, you know, and it’s really common, I think, among our guests, you know, they’ll have different roles in different companies, which is really cool. It’s just unique to see someone who’s been at a company for well over a decade. And so one of the main questions I want to ask Alex is what he’s seen in that time within an organization that just gives you a really unique perspective. Kostas what’s the main question you want to ask Alex?
Kostas Pardalis 00:44
I really, really want to ask him about the migration from on prem to the cloud, especially for a company of the complexity and the size of Intuit. So I’m very, very excited to talk with him today and learn more about this.
Eric Dodds 01:02
Great. Well, let’s go and ask our questions.
Kostas Pardalis 01:05
Let’s do it.
Eric Dodds 01:06
Here from Intuit, Alex, thank you so much for joining us on the show today.
Alex Lancaster 01:13
Sure. Thank you for having me.
Eric Dodds 01:14
Now, I’m really excited to chat with you. Because I think you’re gonna bring, I think, a unique perspective, you’ve spent well over a decade coming up on 15 years at the same company working in software and data. A lot of the guests we have, you know, have been at multiple different companies over that period of time. And so I’m just really excited to hear about your perspective, having been at the same place over such a period of change to technology. So why don’t you start out by giving us just a little bit of background on yourself and talk about what you do at Intuit.
Alex Lancaster 01:51
Okay, so my name is Alex Lancaster. I’m the risk data engineering manager at Intuit. In February, it’ll be 15 years there for me. And before that, I worked at United Title Escrow as a software engineer for four years. And before that, I worked for an MLS company in Simi Valley for almost six years. And my current work for Intuit is mostly in the data and engineering, data warehousing, data pipeline space for risk and fraud management for money movement. And we also do some stuff for the compliance folks and pricing and accounting and finance teams. And we help design all kinds of data marts, data warehouses, reporting dashboards, things like that. And then our product internally is known as the risk data mart.
Eric Dodds 02:40
And could you explain just for the sake of our listeners, could you explain the concept of a data mart within Intuit and sort of how it seems like a product that your team is producing for other people within the company? Could you dig into a little bit about what a data mart is?
Alex Lancaster 02:57
Sure. So it’s, you know, it’s usually a large collection of tables that have been brought in from different sources, many different sources, we probably have, you know, 20 or 30 different sources that we bring in data to. Usually these are, you know, front-end source systems, and they’ll do one little piece of the pie or piece of the business. And then when it comes time to understanding the big picture, and people want to do reporting for long periods of time, and they want to aggregate data and roll up data across lots of different functions, you know, you’ve got to have that all in one place. So that’s mostly what a data mart is about. And also, you know, the data is often transformed or pivoted or flattened, whatever you want to call it into schemas. And, you know, this is where the Kimball conformed dimensional schema came from many years ago, and you know, so people, you know, you want to transform the data in a way that makes it work really well for reporting and analytics. And because the source systems that are upstream, they’re usually designed to be fast at a transactional level, you know, so, so you can select insert, update, delete, be really fast for one record, but in a data warehouse, you’re running queries across, you know, millions or billions of records and long periods of time. And that’s a totally different kind of workload than the upstream source systems do. So that’s sort of a summary of what a data mart is.
Eric Dodds 04:29
Yeah, I mean, it’s, I mean, I, I have a ton of questions. And I have one more before I hand it over to Kostas. Because I know he’s probably chomping at the bit with all the interesting stuff there. But one observation is that, you know, we’ve talked to a lot of different people from a lot of different companies working in, in data engineering, and it’s really cool to hear about, I guess, I would call it the productization of delivering data to the company. I mean, even with a name like data mart, you know, in much smaller organizations, you know, you basically have the software engineering team also doing the data engineering team. And then you get a little bit bigger and you have, you know, maybe a data engineering team and perhaps a data analyst, and then those teams grow. But as sort of an individual delivering those things. But it’s really neat to hear about how you’ve really productized that in a pretty significant and widespread way at Intuit.
Alex Lancaster 05:23
Yeah, yeah, I think it has a lot to do with the size, right? So when you’re, you start out small, and you’re just supporting a few people or teams, you can approach it one way. But no, I have 11 engineers on my team now. And then we’re supporting 400-plus people across the enterprise. So when it gets big like that, then the story changes, and the way that you approach things has to change. And also, when you’re talking about money movement, and compliance, and SOX audits and things like that, you have to get more serious about how things are architected, and that sort of thing.
Eric Dodds 06:00
Sure. Okay. Well, I’m going to ask my I’m going to ask my burning question, based on your time at Intuit, and then I’ll hand it over to Kostas because I’m monopolizing the conversation. So almost 15 years at Intuit, congratulations, you really, I think I’ve seen sort of what we would call like the data engineering revolution firsthand, with just massive change in technologies, data infrastructure, the coming of age of the cloud, just all sorts of different, I mean, major, major milestones, in terms of the way that sort of software is delivered and consumed today. So I’d love to know what are–when you look back over 15 years at Inuit–what are some of the big revolutionary changes that you’ve seen in the data engineering space?
Alex Lancaster 06:46
I think, for me, at least, the biggest is the move from the on prem world, into the cloud world. So when you have an on prem data center, you know, you’re, maybe you’re using, you know, storage area networks, and, you know, you have to worry about your own infrastructure and worry about storage space, and how many nodes do I have? And how much space left do I have on my SAN, and it may be if you have active passive or active data centers situation, you have to worry about replicating your data across to the other data center. So this world, while it was okay, and, you know, some companies were better at it than others, it had a lot of problems and drawbacks, you know, there was always people messing around with the network infrastructure, and, you know, doing patching or updates at weird times, and they may or may not tell you about it. So as good as companies could get at that, I don’t think they’re anywhere near what the big cloud companies are today. So when you move to a public cloud, and you’re in an Azure or AWS situation, those guys are investing billions every year into their cloud architecture, and infrastructure. And, you know, I’m pretty sure no company, even governments can’t compete with that kind of investment. And so they’re, they’re really good at it. And they have designed their cloud environment, you know, from the ground up, to be very scalable across the world. And you know, it just, and it lets you get out of the business of worrying about your hardware and your storage. And, you know, do I have hard drives that are popping network cards that are popping that kind of thing, you don’t have to worry about that anymore. So you can scale in a way that’s just impossible to do on prem. So to me, that’s, that’s the biggest kind of change I’ve seen, you know, in the last 10 years or something like that. So I can talk a little bit about, you know, what we were doing on prem versus what we’re doing in the cloud today. Just a quick summary.
Eric Dodds 08:58
Sure, yeah, that that’d be great. And I do think, I mean, I’ve never heard the comparison of, you know, not even a government can invest that much into the technology. And I think that’s just a fascinating comparison. So thanks for that. That got my mind going. But yeah, we’d love to hear about your migration from on prem.
Alex Lancaster 09:17
Sure. So we were when we’re on prem, you know, we actually had a pretty nice setup, we were using SQL Server Enterprise Edition. We had a nice, you know, Dell fiber channel SAN dedicated to our environment, it was 185 terabyte, which is pretty good size. And we had two data centers. So we have an identical setup about 1,000 miles away from each data center with data replication running between the two. And you know, that worked well for a while. And 185 terabytes is you know, nothing to sneeze at. It’s mostly row store data, though. SQL server does have column store index. So there were some, you know, columnar tables which we’ll get into later, but mostly row store stuff. And then In September 2017, you know, we started to get work on the AWS public cloud migration. And we decided to do a full tech re-stack at that point, not a forklift. So the difference is, when you do a tech re-stack, you’re basically re-architecting everything you have, moving the products of everything you have, so we moved away from SQL server to Redshift, for example. And that took us like 18 months to do that. So by summer 2019, we were pretty much all in AWS, and then we were able to turn off our on prem infrastructure at that point. And now, you know, we’re all in the cloud using the native services there. So we use things like EMR and Spark clusters and Parquet files, and S3 and Redshift, Aurora Kinesis, you know, MSK, which is the managed Kafka service, Glue, CloudWatch, things like that. So that was a huge change for us. And it took us 18 months to do that. So, you know, it was painful, but it was worth it. And now, we’re able to support an environment where we’ve got around 600 terabytes of columnar compressed storage. So that’s, you know, 10-to-1 compression ratio right there. So if you tried to take that 600 terabytes and put it in a row store, you’d end up with, like, you know, 6,000 terabytes. So that’s, that’d be really hard to manage on prem, you know, in some kind of SQL Server, Oracle environment. But in the cloud, you know, I’m not too concerned about managing 600 terabytes. And then plus in the cloud, Amazon is managing a lot of data replication for you, you know, they’re doing patching and management stuff for you. So a lot of burden is on them. And that allows my team to focus just on building application logic and serving our customers. And I don’t have to worry nearly as much about what’s going on in the data center anymore.
Kostas Pardalis 11:56
Alex, it’s very, very interesting. It’s very exciting for me to have you here today, because you are one of these rare cases of people who have experienced both the on prem and the cloud solutions. And it sounds so far that you’re pretty excited about the cloud. And correct me if I’m wrong, but probably you prefer it and you find a lot of benefit in being deployed on the cloud instead of using an on prem infrastructure. Many people say that one of the benefits of having on prem deployment has to do with security and compliance and the control that you have. What’s your opinion about that? Do you think that this is actually something that’s a real concern? Do you think it is addressed right now by the cloud providers? Do you think that there’s still work to be done there? What’s, what’s your feeling about it?
Alex Lancaster 12:46
I think the security is fine in the cloud. And you know, at Intuit, we have a central security team, we have a data handling team, and they help the various PD teams, you know, set up their account in a certain way that they have intuitized AMIs. So when you’re restocking your AMIs there, they come bundled with all of the, you know, security that they want. And we have all the KMS keys and things like that, that’s locking down S3 buckets and encrypting data at rest the way we want. So I don’t see a problem with that. But at the same time, we have a central team of very smart people that have looked into the details of all this, and they’ve carefully architected things to a corporate standard, and we follow that standard. So, you know, to me, everything works awesome in the cloud, and I would never want to go back to the on prem way of doing things.
Kostas Pardalis 13:43
That’s great. Is there any kind of advantage that you still think that on prem has compared to cloud?
Alex Lancaster 13:49
If you’re small, maybe. I really don’t think so. I mean, honestly, I think that the age or the time of the on prem data center is quickly evaporating and going away. I don’t think … if you’re a new company, and you’re thinking about building infrastructure, to me, it makes no sense to do it on prem, just build your stuff on the cloud from the get-go. And maybe there are certain industries, or certain weird use cases that I haven’t heard of that you really need some kind of on prem supercomputer, or maybe you’re like a weather modeling place or something, and you need a crazy supercomputer. But I mean, these days, there’s so much variety and options in the cloud to do huge machine learning, huge modeling of data, and handling of many, many petabytes of data, very straightforward. So I just don’t see any advantage really to on prem anymore.
Kostas Pardalis 14:45
Yeah, makes sense. That’s very interesting to hear from you. Going back to the things that you mentioned a little bit earlier, during your conversation with Eric, where you mentioned about data marts. I mean data marts in data infrastructure is one of the last steps before the customer, the internal customer, you have the user of this data is going to consume it through like a BI tool or whatever other tools they have. Can you give us an overview of the architecture that you have today, at Intuit, I mean, the architecture of the data, infrastructure architecture that you have, and what kind of pattern you’re following. Is it something like a data lake, it’s more like built around the data warehouse? And let’s talk a little bit about this, because I think you’re going to have a very interesting case. And so you, you’ve done a lot of, let’s say, very thoughtful decisions around that stuff. So I think it’s going to be very useful for both me and Eric, and also like the people that are going to listen to the show.
Alex Lancaster 15:42
Sure, so we do have a central corporate data lake that is there. And we do pull data from that. And we also register, you know, our transformed files with the central hive meta store so that it’s visible to other people that use the data lake. But we also have to pull from upstream transactional systems, and also streams to get data in our environment. So you know, we use EMR clusters to do query based ingestion from some places, we use Kinesis streams and MSK to pull data from queues. There’s different latency requirements that we have. So the lake, for example, could be like a 24 hour kind of latency situation, where, and then if you try to pull from upstream, you know, transactional databases, maybe you’re running mini batches, and you’re pulling every two, three hours or something from them. And then if you have very low latency situations, you’re talking about streams, so like, you know, Kinesis stream, MSK, you can, you can get data into your warehouse every 15 minutes, every 30 minutes, something like that. So we have all those use cases in play today. And I think one of the most important architecture things for me is, do your ETL outside of the database, right. So when we were in SQL Server on prem, you know, we were using SQL Server to do the ETLs. And we had lots of stored procedures and using SSIS, and all that. So all that’s gone away now. So we use EMR and Spark clusters, and we have several of them. And we can scale out our Spark clusters as needed. You know, we can use persistent Spark cluster, we can use transient Spark cluster as needed. And also Lambda functions, when you’re talking about streaming, you know, Amazon manages the infrastructure for our Lambda functions, and we can handle you know, hundreds of thousands of messages a minute in that scenario. And then, you know, you do your transformations there and in Spark and so on, and then you write, you write it back out to S3, for your final summary tables, right? So use Parquet in S3, and you can partition Parquet files, you know, huge Parquet files, right, that can be, you know, dozens of terabytes large in S3 with no problem. And then you just use this copy command to load that into Redshift very quickly. So Redshift has a way to, to do parallel loads with Parquet files in S3 very fast into Redshift, you know, most of them take seconds or a minute or so. And then Redshift just becomes like your, you know, serving layer at that point. So that’s, that’s sort of the main architecture overview.
Kostas Pardalis 18:28
That’s very interesting, actually, you said something that I, I’d love to learn more about, you mentioned that it’s better to have your ETL logic outside of the database or the data warehouse, which is quite interesting, because I don’t know if you have heard of all this movement in the market from going from ETL to ELT, which is more of the paradigm of, let’s extract the data, load the data into the data warehouse, and then run any kind of transformations that we want to insert the data warehouse, instead of doing it on the fly. So why do you believe that it’s better to have ETL outside? And what’s the difference? Like, what were the problems that you had when you were doing the opposite with MS SQL Server?
Alex Lancaster 19:11
Okay, so when you do the ETLs in the database, you are sort of boxed in or limited by that machine, right. So if you need to handle some giant ETL job with billions of records, you’re running that on your database. And when the data gets really big, you start to have problems with this approach. So when you take the ETLs out of the database, and you’re doing it in EMR with Spark, now, if you need a 50 or 100 node Spark cluster, for 30 minutes, whatever to process some 50-plus billion row ETL you can do it and it’s not gonna touch or hurt your database or affect the resources there. And then you use the big data Parquet format in S3 to store your transformation back out. And you can partition the Parquet file, you know, however you want, is very useful. And then the copy command works very good with Redshift to load the data in there. But at the same time, you can use your Parquet file in S3, to share your data back with the lake. Right. So what you do is you use a hive cluster to register that table with a hive meta store and then the lake becomes aware of your Parquet file sitting in your account, you don’t even have to move the data anywhere. It’s just a metadata entry in there. And then people can query the lake and see your Parquet file and query it right away. So you solve two problems, right? You’re solving the problem of sharing big datasets with like data science folks that want to use your output with like SageMaker and their own Spark clusters. And what they really want to see is Parquet files in S3. And then you solve the data warehouse use case with Redshift where people just want to use, you know, SQL to query it. And they have things like Tableau and Business Objects and Qlik Sense and so on, connected to Redshift that that works well for them. And, you know, it’s just the quickest way to this scenario.
Kostas Pardalis 21:14
This is great. I have another interesting question, at least for me, you’re describing a very like modern data architecture that you have deployed on AWS. And that’s from what I understand, like a pretty recent development. Right? You, I think you said that you ended the transition to the cloud in 2019, or something. Is this correct?
Alex Lancaster 21:35
Yeah, we were finished in summer 2019. And it took us about 18 months to do all of that.
Kostas Pardalis 21:41
Yeah. So this lake architecture that you have, did you have any part of this architecture also when you were on prem, or the architecture that you have there for your data infrastructure was completely different.
Alex Lancaster 21:54
So on prem, they had the Intuit analytics cloud, the IAC, it was a big hive cluster, Hadoop cluster, it was not very good. It was nowhere near what we have in AWS with the S3 data lake now. And it was always having like, space problems, and, you know, throughput problems and stuff like that. We just couldn’t operate it on the scale that we wanted to. And the lake really, in my opinion, the central data lake wasn’t truly realized until we got into AWS. And we got everything in S3, and everybody put Parquet files in there and became like this really usable, powerful thing at that point.
Kostas Pardalis 22:40
It’s really fascinating. How do you do that? How do you design this transfer from this on prem solution that you already have in your running operational, and drives your business? And in 18 months, you have completely substituted this environment with something completely new, right? Because it’s not just that you are changing your infrastructure. It’s not that you did just that, like you re-architected the whole data infrastructure that you have. So what it takes from an organizational point of view, and from the engineering perspective to do that, like, how do you do that? I’d love to hear more about how you did it successfully.
Alex Lancaster 23:22
So the first, you have to make a decision about forklifting versus tech re-stack. That’s a key decision. Personally, I wouldn’t recommend people to forklift what they’re doing on premise into any cloud and then try to duplicate what you’re doing on prem using virtual machines in the cloud. That’s really not what the cloud is designed for. And you can do it, it’s true, but you’re not going to get the result and the value and the benefit from the cloud that you could if you use the native services there. So we decided to do a full temporary stack, we wanted to use all the native services in the cloud and really use the cloud for how it wanted to be used. And we wanted to get into Spark. We wanted to use the Redshift MPP, which is a managed service. And then instead of SQL Server, we use Aurora, we have a small Aurora database, that’s also a managed service. So that’s like the start of it. It’s that decision, forklift versus temporary stack. And then, you know, you have to, there’s a lot of learning. That 18 months was painful, you know, we had a lot of learning and a lot of trial and error on things, but, you know, we had some architect people to guide us with decisions. We had technical account managers from Amazon to help guide us with certain decisions. So that helped a lot. And then, you know, you have to make sure that your manager and his manager and so on is on board with that, and, you know, your executive sponsorship is on board with that about what you’re doing and why you’re doing it. So you have to, you know, politically and you know, program management wise, you have to communicate a lot about what you’re doing and why you’re doing it, timelines and so on. And then you got to get your customers to come along for that ride at the end, and convince them that you’re doing the right thing for the right reasons. So it’s a complicated thing. But at the end, I’m glad we did it this way. And for us, that tech re-stack decision was the right one. I see other teams who did not make that decision. They decided to forklift, and I see they struggle. They have all kinds of issues from doing that. And I’m just so glad I’m not on those teams.
Kostas Pardalis 25:38
Yeah, that’s, that’s amazing. I mean, congrats for successfully doing this project. I mean, for you, and the whole team that was involved in this, it’s really amazing. Because it’s not just I mean, it’s also amazing from an organizational standpoint, because there’s always resistance in change, and you’ve decided not just to change, but to radically change your infrastructure and the way that you operate. And that’s amazing. And it says something also about the culture in the company. One last question, before I let Eric continue with his questions. Is there a particular technology that became available to you after you migrated into the cloud, that you are really excited that you are using? It’s something that you consider as a game changer in your work?
Alex Lancaster 26:23
Yeah, I would say the streaming. So being able to use Kinesis streams with Lambda, or using MSK managed services for Kafka. That’s, that’s pretty huge. Because now you can get data in your warehouse like, you know, 15-minute latency, 30-minute latency, and, and handle huge throughput, right? So we can handle, you know, hundreds of thousands of messages a minute, with no problem. And Amazon is scaling out on the back-end handling all this crazy message infrastructure. That’s something that we just could not do on prem. And it’s exciting because people can, your customers can see what’s happening, you know, in production, you know, 15 minutes after it happens. And that just wasn’t really possible before, in a big data, you know, data warehouse situation.
Kostas Pardalis 27:15
That’s great. Eric, it’s your turn now.
Eric Dodds 27:18
Awesome. I was gonna say, thinking back, you know, you said 18 months. And, you know, that’s, that is a non-trivial amount of time. But my gut reaction to hearing that, especially now, after hearing more of the details of the migration, That actually sounds really fast for how fundamental of a shift it was technologically. So again, I’ll reiterate Kostas, congratulations on that. Because that’s, that’s a monumental effort in a relatively short amount of time for how much you changed.
Alex Lancaster 27:55
Thank you. I think it was worth it. So we’re happy with where we are now.
Eric Dodds 28:00
I want to, I guess, my whole thing with this episode I’m discovering about my line of questioning is about understanding sort of the course of your career. But I noticed that you were a software engineer before working in the data engineering space. And I’m just interested to know, you know, and you did software engineering at Intuit. How does that change your perspective on data engineering? And specifically, you know, or do you think that there are things that you experienced as a software developer that make you more valuable as a data engineer, especially with sort of the range of, or the scope of work you’re doing across the organization with all sorts of different types of data?
Alex Lancaster 28:44
So, yes, I was a, you know, software application developer for most of my career. And then right around October 2010 timeframe. I left that software engineering team, and became part of the risk, you know, data warehouse, BI team, and pretty much been working in that space ever since. So, you know as a software engineer, I was working with, you know, highly transactional systems. And, you know, data sets were very, mostly small. So you build like a business application or website or something like that. And you’re just dealing with lots of small transactions, and usually working with, you know, relational databases like SQL Server, Oracle, and so on. So I did that for a long time. And now I’m happy with what I learned in that space. And, you know, I learned kind of what the limitations are, although, at the time, you know, I didn’t think about their limitations. I just learned how that world worked and, you know, dealing with transactional DBs and getting good at writing SQL and stored procedures and learning how to, you know, tier your applications and those kinds of things, but I think just around October 2010, I just got more interested in It warehousing space and I started working on that. And that’s when I started to, you know, build the risk data mart at that time. And then I’m still happy in that space. And just, it’s more on the back end, of course, but it’s just dealing with different kinds of problems. And the data is much bigger, and the problems and scenarios are different. So it kind of felt like a new job now, and in many ways, and kept my interest up in the space. But I think it, you know, really helped that I know the front end, as well as the back end, and kind of what the pain points are in the front end and understanding what they’re about, I think that helps me deal with the back end stuff and be sympathetic to those things.
Eric Dodds 30:45
Yeah, absolutely. It gives you having been in sort of the shoes of someone who’s doing a certain job that has an output that you deal with, gives you an I’m just thinking back on experiences I’ve had where it just gives you a lot more empathy, you know, in terms of dealing with some of the issues that come with data, which is always messy, you know, in some form or fashion, and always requires some level of cleansing. So Intuit’s in the financial space. So you deal with sensitive information? Could you talk through how that impacts your work in the data engineering space? I mean, you talked a little bit about the security in the cloud. But finance is one of the most highly regulated industries there is and dealing with that data I’m sure presents pretty particular challenges. I’d just love to hear about what some of those challenges are, and then how you deal with them, as you know, as a data engineering manager.
Alex Lancaster 31:48
Sure. So the group I work in is mainly in the money movement space. So this is things like payments, payroll, QuickBooks capital, you know, moving money around, dealing with, you know, card entities like Visa, MasterCard, Discover, Amex, pin debit, ACH, those kinds of things. And, you know, it’s a lot of parallels with being a bank. So I always kind of remind people that Intuit is almost like a bank, but not quite a bank. So there’s a lot of things that we do. Yeah, we need to do things, you know, that are very common with a big bank. So you have all kinds of compliance issues that come into play. So like, you know, PCI compliance, SOX compliance, you know, for tax, they have 7216 compliance, and you deal with entities like Office of Foreign Asset Control, NACHA, FinCEN. If there’s big fraud events, you know, we have contacts with the FBI, and so on to help us deal with fraud attacks. And so, there’s a lot of regulations and stuff that you have to deal with. And that’s not fun, but it’s necessary. And also, you know, encrypting data in transit, encrypting data at rest, and dealing with keys, how you’re handling keys, how you’re handling sensitive fields, all of these things are important. And there, there are central teams that help the PD teams, you know, deal with the stuff and make the right decisions and make sure their account is set up in the right way. And that they’re using the keys properly. And, you know, they understand, you know, two-way encryption or hashing and stuff properly. And so, you know, there’s a lot of guidance and help in that space. But yeah, it is very similar to banking. And when you move a lot of money around, there’s a lot of risk that comes. So fraudsters are always trying to, you know, attack the system and create fake accounts and launder money. And, you know, so it’s a big kind of soup of issues that you need to deal with on a daily basis. But it’s a fun, fun space to work on.
Eric Dodds 33:57
Yeah, I mean, I’m sure. You have to solve all sorts of interesting problems, you know, as the entire world has gone digital. One question, we had had a guest on recently who worked in data science in the healthcare space. And he talked a little bit about some of the challenges he faced in a very highly-regulated industry building models with, you know, PII or sensitive data. I know you’re not on the data science team, but it sounds like you deliver data products to them, you know, or collaborate closely with them. Is there anything on the data science side in terms of dealing with financial data or sensitive data that presents particular challenges?
Alex Lancaster 34:43
Yes, you know, so I think that they can use hashed fields for a lot of things. So instead of having like a full tax ID in the clear, they can use a hashed value of that, for example. So yeah, I’m sure there’s issues when they do their featurization. And they’re coming up with you know which features are going to be, you know more powerful than others and more influential than others. They have to use, you know what’s available to them. So some of the things they can do is have a real-time model, for example, in line with a transaction or an onboarding event. And in that case, they have access to data as it’s coming in for a transaction, and they can see things that you couldn’t otherwise see, like in the data lake, for example. So for those kinds of, you know, real time models, they’re able to do some fancy stuff there and have access to data that wouldn’t be normal to have access to in the lake. And then for a batch model, for example, they can run huge batch models for portfolio analysis or whatever, on lake data or data that we have in our S3 bucket. And then those models might use like, you know, hash values for sensitive fields, for example. So I think they get around it. But there is definitely a big difference between, you know, batch machine learning and real time machine learning.
Eric Dodds 36:06
Very interesting. Okay, one more question, because we’re getting close to time here. And, you know, we talked to all sorts of different people, but it’s really fun to talk to data engineers, because we get to ask all sorts of specific questions about data. So you’ve talked about multiple different types of data, just in answering some of the other questions. So you know, Parquet files, etc. But I’d love to know the breadth of the types of data that you and your team deal with. And then if there are any particular types of data that sort of present unique challenges for you as you’re managing it. I mean, how many pipelines do you manage? It seems like a huge amount.
Alex Lancaster 36:45
So we have, you know, well over 1,000 jobs in our environment. And then for streams, you know, we have a couple dozen streams going. So yeah, it gets complicated. And then there’s dependencies, right. So if you have, you know, 1,500 or 2,000 jobs, whatever it is, certain jobs need to execute before others. So there’s a complex dependency web that needs to be managed there. And so we need to take care of that too.
Eric Dodds 37:15
Sure. And so in terms of the types of data, the various sets of data that are flowing through, through those pipelines, I’d love to know just, some of the some of the major ones to understand the breadth of different types you’re dealing with.
Alex Lancaster 37:30
So that the standard in the data lake is Parquet, right. And Parquet is nice because it includes the schema and the header of the Parquet file, so you can look at a Parquet file and natively understand the schema and the data types in there. And then it’s partitioned into, you know, many files in S3. So it’s easy to read that way. And you can read specific partitions if you want. So that’s the data lake standard. But if you’re doing, messaging or streaming, usually JSON format messages are common there. And some of those can be, you know, pretty simple and trivial. Others can have deep nesting and be kind of complex. So you have to be, you know, adept at parsing JSON out to do whatever processing or flattening that you’re trying to do from a data warehouse perspective. And then we don’t have to deal with fixed field formats, you know, much anymore, there’s usually the upstream teams dealing with that. So, like Experian ARF format, for example, can be fixed field. So some of the older mainframe systems that data vendors use, they might have, you know, fixed fields, or csvs, and things like that. But that’s much more rare. And then we don’t have to deal with any images or audio-video data as of yet. So I haven’t had to deal with that part.
Eric Dodds 38:47
Yeah, we had someone from Netflix on as a guest, and it was pretty fascinating to hear about them dealing with, you know, audio-video data, because it’s, I mean, it’s pretty heavy duty, you know, when it comes to file sizes, etc. I know, I said only one more question. But here’s a quick follow up. What challenges do you face in data types? Like, is there something that you find you constantly have to deal with or sort of has required you making changes in the pipeline or addressing?
Alex Lancaster 39:21
Well, I think you have to be good at detecting problems upstream. So sometimes the upstream systems, they’re not really aware or nice to the downstream systems, and they can make breaking schema changes, they can change data types in the middle of a table. Sometimes they change the meaning of fields, and they don’t really think too hard sometimes about the downstream implications of that. So that’s a challenge. Also, the upstream systems may not be aware of the data can still be there for like 20-30% of the time, but then, you know, they did a code change and now you know, the other 70% of the time, the field is not populated, and they may not notice that right away. But when you do, you know aggregations and pivots and stuff like that, that kind of problem pops out very prominently. And you can see big drop-offs in field population. So sometimes we have to, you know, tell the upstream system, hey, you know, what happened with this field. And, you know, on Tuesday, it was populating, 99% in here. And Wednesday, we only see 70%. What happened? You know, sometimes it’s news to them, but so you just have to be sort of prepared. You have to be good at detecting problems with the operating systems and problems in the lake, too. So there’s different techniques for that.
Eric Dodds 40:41
Sure. All righty. Well, we are at time here, Alex, it has been really fascinating to hear about all the work that you’ve done at Intuit, all the incredible work you’ve done at Intuit and I know that our listeners will really appreciate the insights that you provided, especially around handling major migrations. So thank you again for your time and for teaching us so many great things.
Alex Lancaster 41:07
Thank you very much for having me on today. Appreciate it.
Eric Dodds 41:11
All righty. That was absolutely fascinating. I mean, I think one of my big takeaways is that Alex manages 1,000 pipelines, which is kind of mind boggling to me. That sounds … I am getting a little bit stressed just thinking about that. What stuck out to you?
Kostas Pardalis 41:33
Well, I think monitoring 1,000 pipelines is nothing compared to re-architecting and re-deploying everything from on prem to the cloud in 18 months successfully. Right. That was, that was insane. I think they are very, I mean, he was very modest and very cool about it. But the team and the company, I think it’s also like a big success for Intuit and the culture they have. This kind of radical restructuring of such an important thing and complex thing as a data infrastructure in 18 months, like it’s, it’s insane. I found it extremely interesting.
Eric Dodds 42:10
Me too. Yeah. Alex is so calm. He seems like the type of guy you would want behind the wheel of a huge project like that, because it doesn’t seem like a lot ruffles his feathers. All right. Well, thanks for joining us on The Data Stack Show. Subscribe to get notified of new episodes on your favorite podcast service and we will catch you on the next one.