Episode 75:

How To Become a Data Engineer with Parham Parvizi of the Data Stack Academy

February 16, 2022

This week on The Data Stack Show, Eric and Kostas chat with Parham Parvizi, founder of the Data Stack Academy as well as a few large SV companies. During the episode, Par unpacks a career in data engineering from the history to signs you’d be a good fit.


Highlights from this week’s conversation include:

  • Par’s background and current role (2:48)
  • About Talend (6:46)
  • Nonlinear pathways to data engineering roles (11:08)
  • What a data engineer needs to be successful (17:37)
  • Before “data engineer” was a title (27:59)
  • Signs you should be a data engineer (32:39)
  • Curiosity and data engineering (38:31)
  • Defining the modern data stack (45:07)
  • How to get a feel for data engineering (52:52)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.


Automated transcription – may contain errors

Eric Dodds 00:06
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. Welcome to The Data Stack Show. Today we are going to talk with Parham Parvizi. He has a long history working with data. In fact, he was one of the first couple people that Talend has been around for a really long time. And then he actually brought the Hadoop spark infrastructure into Talend and really influence the shape of the products, which is fascinating. And today, he runs a consultancy, and also a school that teaches data engineering. It’s called the Data Stack Academy. Maybe we can find out if he got the name from us. But I’m really excited to talk with him because I have a background in education. So I think one of the things that really interests me in terms of par’s school for data engineering, is what are sort of the the key foundational principles or tools that he thinks that a data engineer needs to have to build a foundation for a career, because you know, it can be hard to kind of distill that down. So that’s what I’m going to ask about. How about you?

Kostas Pardalis 01:27
Yeah, first of all, I want to ask him about like the evolution of data engineering. He’s been around for a long time, and he has been around like, since like, back then we didn’t have like, we didn’t use like the data engineer, right. So it’d be great to hear from Team, what’s the evolution and how from whatever was happening in the early 2000s. Today has changed. And also, of course, hearing his opinion about what’s going to happen in the future. Right. So that’s definitely something that I would like to discuss with him. And then yeah, it would be great like to discuss a little bit more about the technologies and detail of what it takes to be a data engineer and how it feels to be good engineer. Is it fun or not?

Eric Dodds 02:13
Let’s find out and go talk with Par. Par, welcome to The Data Stack Show. We’re super excited to talk with you.

Parham Parvizi 02:20
Thanks for having me. Thanks for having, Eric. Super excited to be here as well.

Eric Dodds 02:24
We’ve talked with lots of your friends as guests on the show. So it’s only appropriate that we can have you on as well. Can you give us your background and tell us what you do today?

Parham Parvizi 02:36
Oh, thanks, thanks. Yeah, I feel like I’m already a cousin of the show. You guys have had like every one of my friends in the industry you know always like oh, there you there There you go. Yeah. Oh man, my background it’s been a while I think I’ve got just lucky continue to get lucky. Throughout my career in my life, I was very lucky to be born in a house that my father was a civil engineer but he really had a knack for computer so we had like a IBM 8080 computer you know, when I was like four years old, and I would load up my floppy disks, and play my games in one time I actually formatted our entire C drive and that was not good. I don’t know if you remember DOS, but if you do format and not specified to drive it defaults to C, which is horrible design, right? It’s like that’s exactly what and then yeah, I went to actually school for computer engineering. So I have a background in engineering and computer science worked actually as a chip designer for a little bit but I would say I really got started I started like even I left hardware engineering came to software engineering, I got connected with a company that you all might know now very large company now Talend but I got very lucky when I got connected with them there were just three or four people in us so again, day one like my laptop was on my lap they’re the only technical employee at that point and pretty much grew with them you know as they grew to the US or trained a lot of folks brought in build it you know, technical team around around found Talend on myself and you know, around North America then we moved to like Asia and other other markets got lucky again, at some point in middle there I went on for account representative was like, Hey, there’s this thing do I keep hearing about like, can you just figure out what this is over the weekend so so I went in and downloaded it and finally got it to compile and work without errors, you know, fairly early on and and then I saw the value of it immediately. I was like, wow, this is this this is the next thing like I either I gotta learn this or I’m going to be obsolete. So like, I started learning it started contributing a little bit and then later on, I got actually connected with some of the founders of Adobe cutting some light and all. All those guys meetings and that was that was just incredible. And being in that market, I moved to a company at the time was called green plum, they evolved to pivotal I was one of their big data, Hadoop solution architects, so I was part of the elite team, they would stand out to fix problems, you know, where there was like, large, like 1000 node clusters, and, alright, here, you have two days, go figure out why this is not working or optimize this, this ETL process that they have see what you know, what’s wrong with it, we did that with some guys, you know, like Don miner that you guys might know, we were part of that team. And then 2017, I decided to kind of go and build things on my own, the dream that I’ve had so started a consulting company, that consulting companies, so something that I run, but just about a year, year and a half ago, it dawned on us that I worked with a lot of people who’ve been in this industry for a while and, you know, we, we’ve kind of been around the block, and we’re like, oh, there’s really not too much resources around learning to become a data engineer. So we decided to develop a curriculum around that. And, you know, it was just us working at nights after hours for a while and then became more serious. And we just launched this program, which is actually by coincidence, it’s called Data Stack Academy. So it’s a boot camp to teach people how to become a data engineer and the skills necessary for that.

Eric Dodds 06:26
Well that is super exciting. And I want to dig into sort of that, what what I see as a non linear career path into data engineering, maybe as you compare it with, you know, sort of software engineering, and you’re getting, you know, computer engineering degree, but really quickly, if we can, Talend has been around for a long time. It sort of predated a lot of the big trends that we’ve seen even over the last decade, so could you give us just a little bit of history, especially for the listeners who may not be as familiar with Talend, but when you started with them, what were they building? And that was back in the mid 2000s?

Parham Parvizi 07:12
Yeah, 2006. I think it was by 2006 that I got connected with him. Yeah. And they were still I think it was still v1 back then. Yeah. Wow. Um, yeah. So originally, I mean, so I mean, the field comes from data warehousing, right, and you take data warehousing and break it into the pieces that it has. And some of the things that we don’t hear too much about anymore, like databases and FTPS. And that also has kind of gone away those thoughts a little bit, right. But then the other piece of that, of course, was the BI the reporting, and that that’s still there. Of course, you know, we everybody has to do bi and one of the other spokes of that wheel was ETL, the, you know, in data engineering, and back then I think it was mostly known as extract, transfer, load your ETL. And the process of getting the data to the point that is ready to be analyzed or viewed, you know, by the BI tools in Talon, Weaver, an ETL tool. So in ETL, there’s a lot of things that you do that, I would say are mundane, right? It’s everyday stuff that can be templated. Like, how you open a connection to a database, how do you write today was how you read a file, those things are, easily be templated in a data engineering job. So Intel, and we had made a very nice visual tool that made you be able to use those connectors, to different data sources and data syncs, and drag and drop a couple clicks. And you already entered. It was a code generator, I think that was really, what was the differentiate of Talend early on, because a lot of ETL tools like Informatica and so forth are on the market, they had it engine behind the back. So you know, you had to know their own language, and they had their own engine, but the Talend engine was just Java, and it generated Java code. So it also allowed you to do a lot of things that are not common, you know, all the ETL tools would get you like 70% of the way there right now that 30%, you still need to code and you need to do something custom. So it allowed for developers to do that. But do the mundane things like just open a database connection, like all of that stuff very fast, visually. And then there was that shift, again, I think, where the engine was quick, quickly becoming MapReduce in Hadoop and Spark. So and I was actually my sort of design again, to be very humble, but I was the person who started that kind of development and Talend. So I made the very first Spark and Hadoop connectors and I was like, No, this, like even Java is not going to suffice anymore. Like we need to go to something that’s distribute highly distributed, highly parallel, like Spark and MapReduce they can be so then we change the connectors to produce those, those code versus just a pure JAR file.

Eric Dodds 10:07
Yeah, very cool. It’s really fun to get those little anecdotes of history sort of in the world of the world of data that we live in, especially I enjoy. And I hope our listeners do too, comparing the things that, you know, are just wildly different with, you know, the things that are like, Well, I mean, some of that sounds kind of functionally pretty similar, you know, some of the modern dueling, which is really fun. So, thanks for sharing that. Okay, let’s, let’s talk about the role of a data engineer, because it’s something that I mean, you know, I would guess that a ton of our listeners have some sort of data engineering or data engineering related role. So it’s certainly not a new term, but anyone on the show, I don’t know if we’ve actually ever stepped back to define the term. And that can be a little tricky sometimes in terms like, super familiar, but we don’t actually put a sharp point on it. Let’s start with sort of the nonlinear pathways that people travel into data engineering roles. Would you say that happened for you? I mean, you started out as a hardware engineer.

Parham Parvizi 11:13
Absolutely. You hit that too, like almost every data engineer that I know, probably has a different background. And we’ve all had that, yeah, some some people call come from a data sort of science, software engineer background, you know, traditional, like, four year degree, followed by Master degree, some people come from the Business Intelligence background, where they don’t transform from being a business analyst. And some folks are just for self thought, if you come from like, actually complete different backgrounds that I’ve seen, you know, like, even from like being a server or a bartender, you know, and they learn everything, all the tricks to do so. So it is very interesting. And, and it’s kind of a thing, it’s part of being a data engineer, because data engineer, as you said, that role is not very well defined, and in tech we’re, kind of jack, jack of all trades and master of none, like there’s like, we know a little bit about a lot of different technologies. But we’re notably master anything. And that kind of define what a data engineering does data engineers like glue, within a company, right, is the person who brings data from all the different sources, that data exists within a company, and meshes those together. So you do have to be connected with all the different arms of a company, you have to understand what those do what those data mean. And you have to be able to bring those together, and then makes you kind of very essential to that company, and makes you very essential as a way that you have to actually know even the business of the company, what the company is. So you know, like what those data are, and how to treat them. But you also have to technically then know how to grab the data from the different, you know, applications that are stored and so forth. But to give you an example, I like to kind of maybe start by, like example, like what a data engineer does, like at something that’d be might all have no or useless, like lift or lift app, you know, a rideshare app? Yeah, like, what is it lift data engineer do, right? And data engineers are, I promise you that all of us have interacted with data engineers, right? They’re always in the background. They’re always there, you just might not hear exactly what they do. Because we’re background people. We’re not We’re not developing apps, right? We’re not developing like, web or everything everybody uses. So in a Lyft examples, you know, in a company generates data. So Lyft as people who go download the app, use the app, and that generates a lot of data, there’s a lot of data as far as like you getting the ride. There’s a lot of data as far as like, actually the ride, like your GPS updates, where you go in all of those stuff. And the app itself is usually developed by your full stack developer, app developers. But it’s data engineer job to work with those folks to grab those data, and then store it in a manner that can be analyzed and move it probably to a cloud, move it to the servers than it needs to be. So the engineer works with them very directly to do that data acquisition, right. But it’s also data engineer job to hand it off to other folks like data scientists and business analysts, who are the business side of that, you know, who let’s do something useful with that data. Like, for example, in data science, you know, example could be in the lift example, like a prediction algorithm where you want to tell when I go to grab a ride, maybe the app tells me, hey, based on the patterns that we’ve seen, you might want to wait 15 minutes, it’s going to be cheaper for you to get this right. Or maybe it’s like a notification, maybe it’s a anomaly detection very like, okay, as you’re going through the ride, the app pops up and says, Hey, your drivers just seems to take too many wrong turns. Right? It’s like we’ve seen, like you’ve just saved based on GPS data, and you’ve taken too many drones, there’s, you might want to just want to know that, you know that. And those are like the data science or machine learning algorithms, but it’s, again, the data engineer to provide the data scientists with all that data. And again, grab the results from the data science algorithms, and provide that back to the user. So it’s, it’s that glue roll again, and even in the other sense, is that data scientists typically work with a smaller data set, right? They, they work in, in a prototyping fashion, or they only made up like developed data science model, looking at 1000 or 10,000. Population is data engineer tasked again, to really operationalize those in a company take those to the masses, okay, now, we take this data science model in this like, prototyping phase, but now we’re going to scale it to like the million or billion users that this company has, make sure that data continues to come through the pipeline, come through the system, and all the users get the results. Right. So that that’s the data engineer role. Also, data engineers, you know, we’re also the glue between all the different applications, so between your CRM, application, sales, application, all of that stuff. We play that role as well.

Eric Dodds 16:34
Super fascinating. I would love to hear from our listeners, if you’re listening to the show, write in and tell us if you have a similar opinion because we haven’t defined it before. But I that was super helpful. And I love using the example. One question I have actually for you Part and Kostas because you both come from actually I think you both come from sort of you have like hardware engineering and software engineering backgrounds. In software engineering, and I, you know, this is something I learned from being in an education business that that focus on software engineering for a while. And our instructors were always very insistent on sort of teaching core principles, because that was, that was way more important than, you know, the specific syntax of a particular language, right? Like, if you understand these core concepts of building software, you can apply them to different, you know, syntax, you know, within within the context of a different language. Would you say that’s true for data engineering, as well? Are there some sort of foundational things, concepts, skills, that you need to sort of master to have a really good foundation to sort of build a long term successful career as a data engineer?

Kostas Pardalis 17:53
Yeah, that’s a really interesting question and it’s a question to be honest, that comes up not only for data engineering, but also for software engineering. What you’re describing, Eric is also the difference between computer science and computer engineering, right? For example, like, which is completely different disciplines to be honest, computer, completely different. I mean, of course, they have like, overlaps, right. And if you are like, computer scientists, you can also become a computer engineer. But there is a reason that one thing is called the science. And the other thing is called an denier. This would be and one of the things that happens a lot like with the curriculums for software engineering is that they are heavily dominated by complete computer science topics. That’s my experience also, right, like my degrees in electrical and computer engineering. But anything that had to do with like the computer side, it was like, more of computer science, right? It’s like I had to prove the complexity of foggers. Right? It wasn’t just to know the complexity of an algorithm. And I had to figure out like, what tools you can use to do that, and why we have, for example, like this different complexity classes and all these things. Now, there’s a debate there. How much do you need from that to become a software engineer? Right? I’m biased, obviously. I like the science side of things. So I think that it’s important to know that also when you move into like engineering, but the engineering track also has some very, very important elements that you need to know if you are going to build products and put technology into like, let’s say, make it useful for people, right, everyday people. And that’s, I think what we are seeing happening goes on right now, data engineering, like I think the biggest person change that has happened, like probably in the past, like one or two years is that all the principles that we have In software engineering or computer engineering, you see them like being applied in data engineering and how we deal with data, right? Quality QA tests, version control, like all these tools that the software engineers using every day to go and do, like, deliver what they have to deliver, we started using the Northwind, or even data. Now, do you need to have, let’s say, there are like some kind of principles that are, let’s say, fundamentally, I would say it’s rarely the same thing with shotgun interviewing, FEMA has the same principles, I don’t see like some huge difference there in terms of like how someone should save the way they think in order to solve the problems there at the end is, again, like, engineering problems. And they’re like problems that you’re solving with software. And either you build software or you’re buried software. And I think that’s like the main very interesting characteristic about data engineering, and like I said, like many times in the past is that it’s a hybrid between ops and software engineering, which maybe my tense the future, I don’t know, we have like data options, data engineering, and will be like two separate I don’t know. But for now, I think the data engineer has to do both. So yeah, that’s I know, it wasn’t like the most straight answer. But in my mind, at least, I don’t see any difference in the fundamentals between the two between software engineering and data engineering at the end.

Parham Parvizi 21:39
I want to bring it back a second and say how big actually data engineering is. And it’s not just us, there’s a lot of numbers behind this. I know there was a dice report from the pandemic here where data engineering was the fastest growing field in tech. And they analyzed 6 million job postings in us for that. And it rose by 50%. It had double digit lead over the second on the list, which was 32%, which was data science. So that’s, I mean, that that’s huge. That tells you everything that this is a field. That is the fastest growing field again in tech, and it that shows in the salaries. So when you look at the salaries on Indeed, right now, average data engineer salaries 119k, that’s higher than data scientists higher than food stack developer, higher than a software engineer. The and, and I want to say that, in some sense, we are very privileged to be data engineers, right? We have very comfortable jobs, we do have salaries like that one of the other stats that in the post is unlimited time off for data engineering, which I don’t even know what that means. But I kind of do, right. Like we can, we’re very flexible. We work very flexible hours, we have great 401 K’s, we have great healthcare. And some of those are just not available to a lot of us Americans these days. And a lot of I want to kind of take that conversation to the folks in to that sense that for the folks that are trying to get into this market, right? And what path they would choose, what do they need to learn to be a data engineer. And it is true, there’s a lot of different paths that you you would take. And even if you’re in software engineering, and you want to level up to be a data engineer, especially nowadays be a cloud data engineer, I think that that’s very key. What do you what do you have to do? I do feel like, from my experience, there are some things that I can say that would work, yes, there might not be a general path. And if you look at a lot of colleges out there, they’re just beginning to have like a master of data engineering program. There’s some boot camps now that do like a specific data engineering career. But here’s what I think would would work really as as the path, I cannot emphasize enough the importance of being a cloud data engineer these days anymore. And if you want to, if you’re on this self learning path(of Data Engineering), start by actually looking at one of the cloud certifications, all the major cloud vendors, AWS, you know, as your GCP have now a data engineering certification on their website, they actually list a lot of good, free resources that you can go to learn the skills to be to be that person. I would also say it’s very important, something that gets missed that as we’re doing things, and as we do in projects, you do those on GitHub. Now the Get your GitHub profile, right. It’s like the most important thing out there. It audit companies. When they hire, they go and GitHub and that’s how they vet the resource. As I do that, myself, this code doesn’t lie, right at the end of the day, you can look at someone’s GitHub profile, and see what they’ve done. The other thing I would say is to get get your hands on a lot of real world projects. So there’s sites like Kaggle, Data Hub IO, where you can go get a lot of these, like real world data sets, real world projects, and then build those and build those on GitHub, and be able to show that, you know, to to employers later on, I would also say, I mean, there’s tons of resources out there as far as like learning where you, you know, data camp, and, you know, like Udemy, Coursera, all those courses that have like, very budgeted, I would say, programs around that. But then there’s also this other kind of learning path, which is your boot camps. And they’re now just becoming some boot camps around data and specific around data engineering. And they can provide some things that the other programs can write this today, I think, one, they provide this, like, declared intention, that’s by far the most effective. When you say you want to be a data engineer, and you go through these boot camps, you sign yourself up for this, like three, four month experience, where you submerge yourself in dangerous, immerse yourself in learning. And there’s something magical that happens when people have that declared intention, right, they’re like, Okay, I’m going to do this. And second, I think the benefit they have is around that shared learning, right? When you’re around other people that are just at your own level. And you can maybe replicate that a lot out there on your own, too, there’s, there’s, you know, Twitch is a great platform. Now, a lot of people like kind of learn together on Twitch, you know, somebody says, hey, on Wednesday night, I’m gonna go learn this tool. And if you want to join, join me, we’re all going to do it together. And that’s, that’s a really great resource, I highly recommend that. But I think, lastly, some of these boot camps that have Career Services, that’s something that you probably won’t be able to get anywhere else. And that’s, that would be very important, especially if you’re new to tech, to have somebody actually advocate for you, somebody go and show you the ropes of where to get a job and how to get a job. That’s, that’s going to be really important. That’s sure my spiel on like kind of the path to data engineering. And again, I do want to really emphasize the importance of learning a lot of these skills on the cloud. And I see that’s where the tech the industry going, I do have a very opinionated opinion on that sort of five tools that you have to learn as a data engineer, because I know if you could kind of go out there and kind of try to research your own, you would immediately become very confused with a lot of different opinions that people have in this stack that you have to learn because we all know as data engineers, we’re also very opinionated. That’s for sure.

Kostas Pardalis 27:55
100%. Par, I have a question. What was there before the title of data engineer? Like, companies have had problems with data since forever. They had demands data, in a different way, obviously, like, we didn’t have cloud back then data warehouses were like, bundles of hardware together with software. But what was there before that, like when you were in Dallas, right? Yeah. We have like 2000. How did you call these people?

Parham Parvizi 28:25
That’s a that’s a great, great question. Yeah, I think it’s kind of evolved with the different technologies, right? So again, if you look at it, it was software engineering, and that kind of evolved to data warehouses, right. And that kind of evolved to data lakes, and now it’s going to data clouds. And if we backtrack, that, I think some of the skills that we’ve had to pick up along the way, right, we went from data warehouses where it was very much an afterthought. You know, business intelligence, in a sense, is an afterthought. It’s like, after the events have happened, after the data has happened. You come and collect things and then make sense of it and say, Okay, let’s get the numbers on an aggregate level and see what happened, how much we sold, how much inventory we have, it was very, purely, again, an aggregation could a very mathematical sense, right? We’re now in kind of like, the 10s, or the teens 2010 2020. Now time, that become much more real time and that become that and then we saw ML and machine learning and data science, right? So the need when, okay, not so much an afterthought. But what do we do now, like this data is coming in? How do we interact with the user in an intelligent way? And how to we use machine learning and data science to give him some perspective of what they’re doing and what you know, give him some some pointers. And I think that’s where data engineering kind of grew alongside this data science right? When in the modern world, a lot of these are real time, if I backtrack that just a little bit before that, you know, we could say someone was a big data engineer, you know, where, and again, even big data was to a point was, again an afterthought, like MapReduce. And spark was an afterthought. We just did it at a larger scale. Yes, like other things came like Spark and Kafka and other these technologies that made that again, more real time, and that was kind of the ship. And now if he’s even step one level back from big data engineer, I think you hit it very right on but in the Talend days, where it was like ETL developer, right now, we’re in data, really, in data warehouse realm, where you were an ETL, extract, transform load developer, you work alongside a business analyst, a very close, and your job was just to provide pretty much the aggregate level tables to get from raw level data to aggregate aggregate tables. And then the business analyst put a visualization, a dashboard around that. And you know, that that went to the executives. And step beyond that, like when we were initially hiring at Talend, there was yes, I would say you would be a software developer or a DBA, a database administrator, you know, something that you were just purely in charge of, like, storing data. And that, that kind of I think that that was like the evolution. So I would say like kind of software developer, I went into, like, ETL. When from there, you went to like a big data engineer, and then big data engineers, somewhere along the way made mesh to data engineer now, like Cloud engineer.

Kostas Pardalis 31:43
Makes a lot of sense. Yeah, well, actually, I think it’s like a very, very good timeline of like, what was there before and how its grew into the role of the data engineer today, totally agree with you. So I have a little bit of a provocative question for you, based on also like the experience, the personal experience that you have, by moving from, like different roles, or from hardware to software then become like getting into data engineering. Let’s say I’m a software engineer writes, I write backend code, I don’t know, something like that. So yeah, a little bit more coding. How do I know or like, what indications I might have, I’ll be happy as a data engineer, and it worked for me like to invest and transition from being a back end engineer into a data engineer.

Parham Parvizi 32:33
Another great question. You nailed it in that. There are some characteristics of a data engineer, I think, personality, there might not be good. But I would, I would, I would tell them, I mean, financially, like we can all agree that there are evidence that this is the fastest growing field in tech, it’s not going away, because data is not going away. It’s not a hype, right. And we’ve all been through a lot of hype in our career. And but at the end of it, is data, and data is not going away. That’s, that’s for sure. There’s evidence that say, again, these dice report that came out that ranked data engineers, the number one, you know, growing field in the field, there’s the evidence of the the average salaries for the engineer, which is higher than any other field in tech again. So of course, there’s, there’s financial benefits of being a data engineer. And you and I can both agree, then we can both pick up the phone, and tomorrow, we have a job, right? Just saying our skills. So that’s, that’s it, we’re very privileged in that, again, in that sense. Now, the personal characteristics of a data engineer, which kind of hurts me to say, is, is first of all of your background people, right? Nobody hears our name, the heater I name mostly when things go wrong, you know, if all the data gets very nice to go, and everything’s fine, nobody would come knocking on your door. But as soon as, you know, people don’t get their notification. People don’t say their emails, people don’t, you know, like, all these apps bill. Or if there’s a data breach or something like that, then everybody’s gonna know your name. You’re gonna be we’re gonna knock on your door. So, in a sense, I like to say that US data engineers are kind of silent heroes. You know, we’re in the background, but nobody hears our name, we refer to the queue right, in the bond sense, right? I hate that use that analogy. But that’s somewhat true. From working with data engineers. One thing that we do because in our field, the devils in the detail, right, as a data engineer, you have to make sure that you’ve accounted for every piece without a software could go wrong. That accounted for all the corner cases. And in that sense, a lot of data engineers are very particular and attention Like the attention to detail, I want to say even like, we’re almost OCD, right. And we are like, if you see my own apartment, it’s very neat. I am very OCD, and data. But that’s what makes a good data engineer the attention to detail, right? So if you have those characteristics, and data engineering is something good for you, if you’re someone who thinks in steps, if you played with too much Legos, as a child, you build pieces together, I did, I believe it still is 13. And, but that that helps. Because again, you have as data engineering, you have all these pieces, and you have to figure out how to build put them together, and you build something bigger from that, right. So those, those are some good facts, I would say the one thing that I want to debunk, if you if you’re out there on the internet, a lot of people say that data engineering is too complex to get into, it’s too hard to learn. I want to say absolutely disagree with that. And, and even like, I see where that is coming from, because people say Oh, cuz you have to learn spark and spark this hugely massive, you know, distributed processing engine, or you have to learn these things like Kafka, again, this very complex software engineering concepts like distributed, processing, right. But those things have been made. So simplified now, like it is spark it is Spark, because a lot of smart people worked for years to abstract away all of that complexity, and make it something very simple to understand and very simple to use. And I would say absolutely, like anybody can go spend two weeks and be a very solid spark developer, right, they can understand the concept of data frames, they can act, they can use it to aggregate data they can use this process data is. So that’s the one thing that I want to completely debunk here, if you learn that data engineering is complex, that is untrue. And it’s mostly like a little bit of gatekeeping talk that there is been a lot of tech, I would say.

Eric Dodds 37:05
Kostas and Par, question for you here, and this might be provocative as well. Some software engineers who are listening may not like this, but one thing that’s interesting when you think about software engineering is you’re building something for an end user, right? So you want to develop empathy for the end user. So if you think about a software engineer, going back to Lyft, as opposed to a data engineer, this software engineer, you know, in an ideal world is trying to build empathy with someone who’s trying to, to get a ride to book a ride, right? And, you know, what are the, you know, sort of what’s happening that creates friction there. And I want to have empathy as I build this, right. What’s interesting about data engineering is your end user is someone in the business. So would you say that, you know, if you think about a software engineer to your question cost us and we’d love your your take on this as well costs. If you have kind of an interest in the mechanics of the business itself, maybe even beyond sort of the experience of the end user? That’s kind of a false dichotomy, right? Because they’re, you know, inherently related. But you said earlier, Par, you have to understand the business and sort of the way that it works and generates revenue and all that sort of stuff. Would you say that a predisposition to being curious about the business is a good sort of prerequisite for being a data engineer? Does that matter?

Parham Parvizi 38:42
I wouldn’t say it’s a prereq. It’s something that I picked up along the way that I think that you can kind of pick up in by, by necessity, you kind of pick up along the way a little bit. But yes, it is essential that you have those ears open, right? You and you have you have your eyes open and you have your ears and listening for the ways that your process is making other people work easier. Like right, just like you said, Yes, like a software developer goes to a company to build, you know, empathy around the experience. Our job is to build empathy for that software engineer to make all the pieces that they need, that they can do their job, right the platform that they need. And then again, get the data and also be able to grab it and then hand it to the again data scientist and then tell they’re going back to this software engineer, Hey, these are the ways that your software is being used. Maybe these are some of the things that you haven’t seen and get make make that loop possible. Right. So yes, it is you have to know the business because you touch again, you are the centerpiece. You are what you are the piece that moves data around the company. So you’re going to touch all sides of business. And it’s very important when you are in those meetings with the different business stakeholders that you really listen, I think a big part of software, data engineers jobs, our job is to listen, to be honest, to be very honest. And then taking those those things that you’ve heard taking back with you and turn it into requirements that what you have to do about your job, your how now that tells you how you have, how you should store the data that kind of matches those needs those business needs, right? And so in to kind of close that up your empathy, like, as a data engineer, I feel like we live for the process, right? Like, yes, maybe it’s not that glorified app that we design, and we made it so much simpler for the person to click, you know, the buttons and get that dried quicker. But we made that process possible. We made all those pieces work. You know, even though we were in the front facing of that we’re in the back, again, connecting the dots.

Kostas Pardalis 41:14
Yeah, I agree with Par. Let’s say we were interviewing for the engineer role, right? I don’t think that I would pay that much attention on how much interested the person is like in the business itself. At that point. I think that’s relevant for everyone on the ends, who works in the business. If you asked me, for example, the first thing that comes to my mind when when I was listening to you, Eric, is data analytics, like, yeah, if we’re talking about the data analysis, being curious about the business important, because you have to understand like the business to go and do data analysis, right. And other sounds, does this make sense? Or it doesn’t make sense. So you can go and figure out if something went wrong, like on working with the data. But for a data engineer, I don’t know, as Bob said, we are talking about, you know, people that are in the background, and that’s part of how it isn’t it’s a good thing. I mean, it’s not bad, right? Like, it’s not, there’s nothing wrong about that. But you learn about the business, let’s say, anyway, like, you cannot not let it say that. And that’s, like, relevant for every engineering role. Right? Go and speak like with someone in operations, for example, like, they know a lot about the business, because depending on the business, they know if they have to be on pole or not, right. I wouldn’t say that there’s some difference between like the data engineer and Daniel nails, the main difference that I would see there, is that data engineers, the, the, the customer of the data engineer, isn’t there, man, right? It’s like the marketing team, the sales team, like the DevOps team, or whatever. And it’s not necessarily like, let’s say, only the person who uses the lift up to the call release and go somewhere. So that’s the main, that’s the main difference. But I wouldn’t say that like, okay, let’s, let’s really like in different way, let’s say I have like a person who is more curious in the business than in the data related problems of like, for example, data you do you have to work with? And so like, what kind of systems you have, if someone was, like, more interesting to the first thing, and didn’t ask any questions about the second, I would be worried. Probably, there’s something wrong with a career path of personal growth. Yeah, probably, if I have someone who’s really interested in the data related problems, and also makes the commercial connection with the business. That’s amazing. But if it doesn’t happen, that’s completely fine.

Eric Dodds 43:58
Yeah, I think that’s a super helpful perspective. Okay, we’re closing in on the end here so we have time for one more topic. One thing we’d like to chat about is you mentioned in terms of prepping for the show, Par, the modern data stack is something that we’ve discussed. I’d love your take on two things, and I’m sure Kostas will have some questions as well. You mentioned that you have pretty strong opinions about the five tools a data engineer needs to know so I’d love to get that list from you because we always love an opinion to take on the tooling. Then the other thing I would like to know is, if we think back to 2005 at Talend and then Hadoop Spark, you’ve seen the modernization of the data stack. There are lots of definitions without a Sargon nailed down, but I’d just love your take on what is the modern data stack and what does that mean to you in the context of what you’ve seen over the past decade and a half?

Parham Parvizi 45:01
Yeah, great question. If I had to really summarize it, I would say the modern data stack now, again, is the cloud. And it’s is, it’s on the cloud. And I, and I explained that in a second. Let me, let me get into the five tools ready, you need to learn. And these are obviously backed again, by a lot of data by that DICE report that I that I mentioned to you that these are some of the skills that were top listed on those, I would say is data engineering, you got to learn the basics, right? The basic, and this is, I would say, number one basics. And I would categorize that as just your basic bash, terminal programming, Python, SQL. And I chose Python as my language. I know, there’s a lot of languages out there for data engineers by Python, by far is the most dominant one. And it’s the it’s, again, that bridge between data scientists and data engineers. So it’s, it’s a great language, again, built for data engineering, right? The second number two, I would say, Docker Kubernetes. That’s becoming, especially for designing cloud agnostic data pipelines, pipelines that work across clouds, and containerization. Now, everybody is now you know, and designing serverless, microservice, these sort of architecture, which is, again, part of the modern data architecture, and Kubernetes, and Docker art, are at the heart of fat. And this was, again, that number, I think Rene’s actually was was the number one skill for data engineering, and that DICE report, if you go to third, something traditional, it’s been around, they’re your number one big data tool, Spark, it’s still you know, your batch processing billions of records. Spark is the tool to go and it and again, it’s very easy to learn spark nowadays, because there’s great documentation, a lot of good resources out there. Number four, it’s now I’m going to move to a little bit of stream processing in real time processing. And this is a tool that is just the foundation for connecting almost all the data pipelines in a real time sense, which is Kafka, right? Your your Kafka, Apache Kafka, and number five, I would go a little bit on the orchestration side. Right, and there’s a tool now that’s becoming heavily dominant on the orchestration side, because again, it’s data engineers, part of our job is to connect the pieces together orchestrate data flows for data pipelines. And that’s Apache Airflow. Right Apache Airflow, now it’s almost becoming the number one number one orchestration tool. But I want to now come back and say, as a modern data engineer, what I’ve seen industry in almost every project that I do nowadays, it starts on the cloud, and they’re on the cloud, and everything’s moving on the cloud. So you need to learn these skills on the cloud. The first two are agnostic on the cloud, right? Bash, Python SQL, Docker Kubernetes, that even cloud vendors don’t even have a different name for a Kubernetes service, they call it Kubernetes. But the last three, each cloud vendor has kind of their own version. And they call it something else, right? Spark, you know, on an Amazon, it’s got Amazon EMR and Google is called Data Proc. And I assure is data breaks. The company was behind this spark, Kafka on Google is called Pub Sub on Azure is ms, or on Azure is Event Hub on Amazon, Amazon is Kinesis airflow has different names, etc. And I say that, again, that I even in my own job, like I would say, we wouldn’t even take jobs that are not on the cloud anymore, because they take so much longer to set up and maintain. And that makes us to be able to deliver it a lot later, even as consultants, and that’s a bad thing. And in consulting, right, you want to go as fast as as you can. And the cloud vendors just have made that so easy. They removed all the complexity of managing this systems, all the complexity of the scalability, there is a huge move towards this serverless event driven architecture, like the use of cloud functions, that is huge on the cloud, the cloud functions that can their event trigger, they’re just a small piece of code that you write, that translate could do a data transform that could do nearly everything. And the cloud itself completely takes care of that scalability to fault tolerance, the all of those topics that we had to worry about, you know, immediately your function can scale to millions of data points and can be triggered in real time to act on on data. And it’s there’s so easy to deploy with one term With one bash command line, you know, you can deploy your code running from your machine to the cloud that is scalable to millions and millions of instances, that is just, that is so powerful these these days. And if I take a step back, again, I think it’s very, in a modern data stack, there is a distinction between solutions and products. And we have to be very careful with that. There are a lot of different products that are out there. But they’re very few solutions. You know, and I want to say like product is like you, if you think of product is like, you know, like, in a, building a house sort of analogy, a product is a nail gun is solution is a framed house, right. And, so don’t get bogged down so much, as far as the products, I would say, and clouds are a kind of solution now, because they give you all of those products, they give you that framed house, almost. And there are some other I mean, of course, there’s other solutions out there, you know, something like snowflake, data, data, bricks, those are now solutions, they they provide quick and neat, you know, they provide any that they’re not a product, there are a lot of products out there, you know, like products around like different ml different machine learning libraries, a different like audio processing library, different video processing, library, code, security, standing chord, you know, for, for faults, those that are getting products. And that DN, those products, I think would move to cloud at some point in the near future, right, all of those products, all of those technologies, if you talk to their if you’re on their board, they’re kind of like, okay, what have you do to maybe sell to one of these cloud vendors at the end of the day? That’s the exit strategy. Right. So coming back to the solution, I think it’s very important in the modern data stack, what are the companies that are providing that frame house, right. And that that’s, that’s very important for us to look at. Then, again, as far as being a data engineer, just going back to those Python SQL Kubernetes, Docker, Spark Kafka airflow, I think, if you learn, I know, that’s more than five, I kind of do Python SQL, it’s one. But if you learn those and learn those on the cloud, I can guarantee you that you would have a job in in this industry, that that gives you the base, you can learn everything else from there.

Kostas Pardalis 52:33
This is great, Par. One last question and I think we can conclude this conversation today. I mean, although we have to have at least another one, I think there are like many things that we can discuss. And there’s value like in doing that. So let’s say I want to take a taste of data engineering, right? Like I know, the data engineer, can you give me like a sample like small projects, that would be like, a good way for me to take a taste into what it feels like to be a data engineer, something that potentially I could also put forte on my GitHub and demonstrate it?

Parham Parvizi 53:15
Yeah, well, I would highly recommend going on Kaggle. Right. Kaggle has like even data, like they’re a great data science data engineering tool, they have a lot of challenge projects that you can actually, like, get into these live challenges where you’re competing with better data engineers and data scientists. But you can look at the historical, like historical projects that they’ve had pick up on some of those. I think that’s a great, great source. There’s tons and tons of examples and projects there and projects that have data with it. So and I know you, you kind of can agree to this casas, it’s very hard to get your hands on big real world data sets. 100% Yeah. And there’s, there’s some sites for that, too, Data Hub IO, I think is a good site for that. There’s a lot of governmental agencies like in our course, we actually took all the flight data. So FF FF, FF, a, actually a, publishes all flight, all domestic flights, data, like very these flights took off where they were going, what airline, a lot of that is public data. And we use that. And it’s a great volume of data of use that to build our course and not to take this opportunity to promote our program, data stack Academy, but we do have a lot of these projects. Actually the first two chapter we have a 10 Chapter course. The first two chapters are free if you go on our site, data stack dot Academy. There’s a site to get started for free we actually send you two chapters that has a lot of these projects has again that data said this FAA data they said that we’re talking about and you can get started there’s there’s a lot of good projects there and we we force you actually To go on GitHub, so you have to develop those on GitHub, kind of by design.

Kostas Pardalis 55:05
Yeah. That’s amazing. That’s amazing. Eric, anything else that you would like to add?

Eric Dodds 55:11
This has been a great show power, we really appreciate you taking the time. And it’s been fun to. It’s been fun to just hammer on the definition of good engineer and talk about some of the specific specifics of the role. I think that’s helpful both for people looking to get into it. And people have been doing it for a long time and even running teams. So appreciate the perspective.

Parham Parvizi 55:30
Thanks. So again, thanks. Thanks so much for having us. Yeah, I’m a huge fan of the show, longtime listener, you guys are doing amazing stuff. Please, please continue to do what you’re doing. And we really appreciate you as listeners.

Eric Dodds 55:43
Well, thank you so much. We love that feedback. And if anyone’s listening, please give us feedback. You can do that on the form on our website at datalake show.com. So thanks, again, Park. Take care. Alright, cast this, why don’t you here’s the question, is being a data engineer fun, or is it not fun?

Kostas Pardalis 56:02
I don’t know why I would say that it’s fun. But I think that’s the the outcome of this conversation to hell with power. And what is interesting, really, really interesting is not much like about gaming, discuss about the technologies and all these things. But it’s really interesting to hear from him. Also, the personality traits, but someone has very, like in depth engineer. And I found these like, super, super, super interesting. So it’s not for everyone, obviously. But there’s huge demand out there. So if anyone’s slack thinking about it, give it a try.

Eric Dodds 56:38
Yeah, for sure. I think one of the things that I, this is the very beginning of the episode, but Well, apartments in this in the middle of the episode, data engineering is so big. And I thought I mean, that’s a simple statement. But it’s really true. And one thing that I recalled from the early part of the conversation where he said that was when he was talking about the early days at Talend, and how it was so cool that they output Java code that you could use to sort of customize that last 30% of your pipeline that you’re building. I thought, I bet we have listeners who might not be familiar with Talend, because you know, they’re young and early in your career, and they’re just using a completely different set of tools. And we probably have other listeners who remember when Talend implemented, you know, the Hadoop spark componentry. And that was game changing. And I just thought, man, you know, data engineering is big. And it you know, now spans multiple decades. And it’s just fascinating. It’s really fun to just be able to talk about things that that hit on both the history and the modern, the modern stuff that we use. So that was my takeaway. I appreciated it the history lesson, if you will.

Kostas Pardalis 57:54
Yeah, absolutely, and hopefully, we will have him back in the future. We have many more topics to cover with you.

Eric Dodds 58:00
I know we need to actually have Brooks start bringing some people back on because we always say we’re going to do that. And then we get busy and we don’t do that. So we’ll 20 That’s our new year’s resolution for the data section. All right, well, of course, subscribe. If you haven’t, you can get notified of new episodes, and we’ll catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.