Episode 141:

A Journey From Backend Engineer to Data Engineer with Ioannis Foukarakis of Mattermost

June 7, 2023

This week on The Data Stack Show, Eric and Kostas chat with Ioannis Foukarakis, Senior Data Engineer at Mattermost and recent winner of Rudderstack’s transformations challenge. During the episode, Ioannis discusses data engineering, ML, and the influence of software engineering on data. Topics also include stories from Ioannis and Kostas’ college days, how data and engineering have changed over the years, future developments and exciting opportunities in the space, and more.

Notes:

Highlights from this week’s conversation include:

  • Ioannis’ background and journey in data (2:42)
  • Rudderstack’s transformations feature and examples of its application (4:20)
  • Winning the transformations contest at Rudderstack (7:21)
  • How Ioannis’ transformation project works for data governance (9:40)
  • Memories from college for Ioannis and Kostas (12:30)
  • Getting into the world of software development (17:27)
  • The changes in data and engineering over the years (20:29)
  • Bridging java with python (23:15)
  • Dealing with ML workloads in the past vs. workflows of today (26:30)
  • Data engineers and ML engineers (33:12)
  • Dealing with data in the early stages to ensure reliability later on (38:39)
  • What creates problems with data quality? (42:11)
  • Exciting developments in data engineering (46:48)
  • Final thoughts and takeaways (51:12)

 

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:03
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. Welcome back to the dataset show Costas fun episode. So RudderStack, the company that helps us put on the show recently ran a competition around transforming data. And we’re going to talk to the winner of that competition in his Ioannis, and he works at a company called mattermost. But you actually know Yanni, from your days in the university, so I have a feeling this is going to be an extremely fun conversation. I’m going to ask the obvious question, what did he build for this competition? Little preview? That’s a pretty cool data governance flavored feature that relies on the concepts of data contracts, but it kind of runs in transit in the pipeline. So a pretty interesting approach. So I want to dig into that with him, because I think it was a pretty creative effort. But you obviously know a lot about Yanni. So what are you going to ask?

Kostas Pardalis 01:19
Yeah, I think it would be great to have a go like, through his journey, because again, he just like me, has been around for a while. And he has had an interesting journey, like from graduating to doing a PhD, becoming like going into the industry, doing back end engineering, to ML engineering to data engineering. So I think he has a lot to share about this journey, and likes the way how the industry has evolved. And then like, I think it would be great also to spend some time with him and learn from his experience about data engineering, and bioengineering, the boundaries between the two. And what it takes to make sure that both functions operate correctly. So let’s do that and start with him. And I’m sure there are going to be some fun moments, remembering the past version. Let’s

Eric Dodds 02:21
let’s do it. Yanni. Welcome to The Data Stack Show. And congratulations on winning the RudderStack transformations challenge. It was really cool to see all the submissions. And you want to.

Ioannis Foukarakis 02:34
Thank you, cash for thanks for having me. It’s great to talk with all of you. Thank you, for your words for this submission. I think I was pretty lucky because there were a lot of great submissions.

Eric Dodds 02:49
Cool. We’ll talk about that challenge. And we want to hear what you built because it actually relates to data quality data contracts, data governance, lots of topics that we’ve covered in the shows that are super relevant. But first, give us your background, you actually have a connection to Costas and your past, which we want to dig into a little bit later. But yeah, give us your background and tell us what you do for work today.

Ioannis Foukarakis 03:16
Yeah, so I’m a data engineer at most. I received my PhD in electrical engineering a few years ago, that’s where I actually had no courses. So after receiving the PhD, I started working as an adjunct lecturer, teaching Objectory oriented programming with Java database systems, and software engineering. Then I moved to the industry, initially as a Java backend engineer and then later as a machine learning engineer. But you know, these things are kind of connected, and I gradually moved to the latest field, which is data engineering.

Eric Dodds 04:00
Love it. And just give us a quick overview of your work at mattermost what does mattermost do? Just give us a quick

Ioannis Foukarakis 04:07
so mattermost provide secure collaboration protecting modern ourselves, governments, banks, tech giants, everybody can restructure the rate productivity and reduce error rates while meeting nation state level security and compliance requirements. They have this really nice stone and a lot of our customers that range from US Air Force to Bank of America, Tesla models made a Facebook ban on the show. Good companies.

Eric Dodds 04:38
Wow. Incredible. Okay, well, let’s talk about the transformation challenge really quickly. So RudderStack And of course I work for RudderStack So I’m familiar with this, but we want to hear it in your words. Our customers love our transformations feature. First of all, tell us you explain transformations to us. What is RudderStack transformations feature And maybe what are some of the ways that you use it at matter most.

Ioannis Foukarakis 05:04
So, transformation is a way to modify incoming events or fleeter events before they reach your final destination. So as soon as the client files an event in eventing detected by RudderStack, RudderStack grams this transformation, the transformation for no sweets logic and then stores the result. So you can think of it as changing the order where the loading transformation happens. So it’s up to you to decide whether you do that affirmation, or what which transformation you do after the data are loading the database or before

Eric Dodds 05:50
data? And what are some of the ways that you use transformations that matter? Most, you know, because you stream data from like multiple, you know, iOS, Android, web, etc.

Ioannis Foukarakis 06:02
Yeah, exactly. So we don’t currently use transformations we are investigating, we have a lot of data coming from clients in and we were thinking about modifying the organization of the data in how events are stored in the data warehouse. So one of the things that we were thinking about is whether we can filter some events that will come in as noise. But there’s also some bugs that might happen if you know, these bugs might exist in servers that have an older version of the code. And you know, you can’t wait for the customer, you can’t force the customer to upgrade that’s installed on prem. So we can use a cloud formation in order to reconcile for this box might identify,

Eric Dodds 06:56
Oh, interesting, right? It’s just like someone’s running an older version of iOS. So they have like a previous version of your instrumentation. But when you update the instrumentation on newer versions, you need to fix the payload to sort of align with the new schema.

Ioannis Foukarakis 07:11
Yeah, that’s one way. The other way is that modernization has the server component, where you can install it on the brain, and the server component, the maintenance on these components, these things that might be outside of our control. But the data that we received from the shampoo that we can modify using professional nations.

Eric Dodds 07:34
Got it? Yeah. Okay. So yeah, so someone who sells it on prem, but you need to modify the data to sort of a lineup. So you grace customer analytics. super interesting. Okay. Well, tell us about tell us about the transformation you built, what was the original problem you were thinking about when, you know, saw the competition and wanted to build something,

Ioannis Foukarakis 07:53
it took something about, you know, it turned out there, it started without the mind having an idea. And I was planning to experiment with it. And the challenge is what pushed me to actually go and implement it. So the idea is that when you receive events, from resources from various teams in the company, you have to agree on the payload so that the data engineers know what to expect, what are the expected fields, properties, and diverse columns, the tables and so on. So one way to, there are various ways to try to enforce these contracts that are agreed between the product teams and the deep engineering team. And one option is to have these contracts in the form of version control. Eyes, like a schema. And the transformation is taking whether the events are adhering to the schemas that you have specified so far.

Eric Dodds 08:56
Yep. So you have an event coming in. And let’s say one of the challenges is either a versioning challenge, like we talked about before, where someone’s running an old version of the app. And so the schema is different, you need to modify that like that could be one way that it doesn’t align, or the developers implement something that maybe isn’t quite accurate, or they change something. And so, you know, as a data engineering team, it’s a way for you to flag that in transit to make sure that nothing breaks downstream.

Ioannis Foukarakis 09:28
Yeah, exactly. So let’s say that you happen, you can conduct Decart. And you agree that the poker chips are going to be A, B, and C, but then for some reason for some news, communication, it’s different teams, somebody goes ahead and adds additional property on the so by taking the schema and depending on how slick we want to be, we can either discard the event or we can change the notification that we notice a new change with a different schema and we need to take action

Eric Dodds 10:01
All right, well tell us give us just a brief overview of how this works in RudderStack transformations like how did you wire it up?

Ioannis Foukarakis 10:09
So I use Qlik transformations in the domestic part of the transformations. It’s great that Psycho Firstpost CPython in domestic transformations I went for, for the JavaScript part because it goes to the point I wasn’t being that confident. I’m more familiar with Python, Saigon to get these kinds of standards. So I want to focus both on offering a solution, but also investigate how you can apply good engineering practices in writing transformations. So there’s already an array of products that the public follows, the link is in the submission. The code there is used in a library for parsing JSON schemas. And in the transformation, what you do is you define the schemas, you map the schemas to the event names, and then the transformation six for each event. Which Hema accord corresponds to this event, does the validation and you can decide whether to the discard or just look the error message, that there’s something that EPO, there’s also some additional code about testing the submission, how to set up test events to see ICD on the transmission and not display some are practices.

Eric Dodds 11:43
Love it. And we’ll make sure to include that in the show notes. Very cool. What a creative way to sort of explore an implementation of data governance, with RudderStack transformations. I love it. What was the most enjoyable part of building the transformation that you built for the competition?

Ioannis Foukarakis 12:07
See same work. Yeah, definitely. But I think it went really smoothly. I didn’t spend a lot of time writing the code. So I really liked that it was a really fast prototype. And then I feel like I spend more time in writing this shaking up the project structure, rather than actually writing the transformation. So this thing was really nice. And the user interface or rather SEC for testing on RudderStack, the Redash on the machine mention shoulder shoulder, the shampoo.

Eric Dodds 12:45
Great. Well, congrats again, super cool projects. Okay, let’s, I want to ask you another question. Because I, your background is really interesting. So you studied electrical engineering, then you got into sort of software development, specifically back end development. And then you got into data engineering. That’s a super interesting story. But at the beginning, you were in school with Kostas. And so I want to hear maybe your best and worst memory of Costas, when you were in school with him.

Ioannis Foukarakis 13:20
I think it’s the same thing. I think me and Kostas, we’re going to the same lab in order to get free internet.

Eric Dodds 13:30
Free internet in the lab.

Ioannis Foukarakis 13:32
Okay, so I won’t say our eight. But back when we had dial up modem, we didn’t have any internet available in our home. So we used to go to specific labs or run some errands, or some assistance to the lab so that we can get access to go to the lab and be able to stay there in code or search the Internet back then. Or to IRC.

Eric Dodds 14:07
Yeah, IRC. That’s good. Is that where you met?

Ioannis Foukarakis 14:11
It goes like this guy was in the same semester. So pasta. Yeah. Some of the same Yeah. Was in the same class.

Kostas Pardalis 14:21
I mean, it’s been a while. Yeah, yeah, we don’t want to disclose our rates, but it’s been a while. So.

Eric Dodds 14:31
Okay, what do you have to ask question you

Kostas Pardalis 14:33
back then way back then or like at the university? So there were like a couple of very specific spots where you would like to meet with people right? Like one was like the coffee shop, but the school right where you would end up there, like meeting with people and drinking coffee. And then it was like the labs where we would do something like what Yanis was describing because back then like having X is like going out. Like, I mean, there is a connection that was pretty much non-existent in Greece. So, okay, that was one of the benefits of being in the School of electrical and computer engineering in Greece, like having access to a very fat pipe for that time, right?

Ioannis Foukarakis 15:22
Yeah, it was one of the main reasons I got my PhD.

Eric Dodds 15:29
That’s that. Okay. I do have to ask, though. Surely at night, you weren’t like, just working on schoolwork? Like, of course, you played games in the lab, right with other people from school with the internet connection?

Kostas Pardalis 15:46
Yeah, so. Okay, so now you’re getting into the interesting parts of the problem is that, like, the more questions you ask, the easier it will going to be for people like to figure out our age, because

Eric Dodds 16:02
I didn’t mention any get names of games and just saying like, you know, based on my own experience, like, Yeah,

Kostas Pardalis 16:09
but okay, like, you have to do that at the end, like we have to talk about, but we’re like, I think like two main things. One was like, the wake arena that we had, like, the university with that, and I think like, that was like a bottle that was posted in the CS lab, if I remember correctly.

Ioannis Foukarakis 16:29
Yeah. I don’t remember what it was, was it what

Kostas Pardalis 16:36
the person was, it was your worst call last week was like hosting his own, like, personal server anyway. And then there were a lot of, like, people getting together like, I think, like in the SOAP lab and playing StarCraft.

Ioannis Foukarakis 16:51
Yes, I think ash I think something like that. But I mean, one of the funniest memories I have is who’s quickening up, we were attending a class luncheon. And everybody logged into the server. We use the names of the professors as nicknames. And it was funny because it was a you know, sweet, old CRT screen. And whenever the professional who was teaching a platform started working towards the back, you could hear I’ll tap in the Qlik. A screenshot because of the magic wave coming back to each one of the funniest memories you have.

Eric Dodds 17:33
Unbelievable. I love it. I love that, that sounds like you know, I love that this is happening in the context of a PhD. That’s just so great.

Ioannis Foukarakis 17:44
About this before the PhD,

Eric Dodds 17:46
okay. Yes, yeah. Okay, so electrical engineering, Quake Arena, Jonnie, why did you get into the world of software,

Ioannis Foukarakis 17:59
I went to school because of software. I liked computers. Since I was a younger one, I got to study something related to software. So how it felt, I mean, all the pieces fell in place when I started to cut so even though I was in the electrical engineering department, practically because of electrical and computer engineering, I focused mostly on the software part. Because I like it the most, I tried a bit of academia . It was very, you know, something like the next step to try after, it’s the grace, a variety of reasons you were there. But I always want to also, you know, I didn’t want to be the only guy with it. I also wanted to write just there. And part of that part of the economic crisis back in Greece, back then I moved completely to, to the industry. The point and they’ve been enjoying it since then.

Kostas Pardalis 19:05
Yeah, and something that we need to clarify something here, like, the school we attended, was like the School of Electrical and Computer Engineering. So the schools were never separated. In the Technical University, we went at least. So if you wanted to go and study Computer Engineering, you had to torture yourself with electrical engineering for a while together with a couple of other things to actually sit okay. And I have like to be honest with myself here although at the beginning like I didn’t enjoy that was that we had like all this variety of like, different like, stuff to learn and go through. At the end. It was like a very interesting experience to learn all these different things and have like, more, let’s say, complete engineering training, ranging from classical Electrical Engineering to starting off with things like communications to electronics to software gaming theory.

Ioannis Foukarakis 20:07
Yeah, if you’re at the moment,

Kostas Pardalis 20:12
yeah, it was pretty theoretical. But anyway, it was good. At the end. We suffered a little bit, but at the end, I think, like, paid off. So yummy. Okay, let’s talk a little bit about this journey, right? Because, okay, like, we’ve been around, like, for a while we like software and the industry was like, obviously, like completely different. But then when we graduated, or even when we ended, like, the school today, as you said, like, you have like the title of like the data engineer, let’s talk about the general a little bit in like your experience, right, like how you have experienced, like the change in the industry. And let’s focus on some things that you, at least, like from your perspective, like you find, like, interesting to share, and maybe surprising also.

Ioannis Foukarakis 21:02
So when I started, as I said, there, I started us, it’s about packing the internet. So Java was the hot thing back then. So it was slow, relatively slow when compared to other programming languages. But it was building up at the moment. And there was a great community back in Greece. At the time, I tried not to like it. And we’re picking up about, you know, early days of spring, and it just moved me away from Salvation circles, and all of this stuff, guys, I forgot the name, then I had an opportunity to start working remotely was around 2012, something like that. And I started working for a data science team. So initially, as a Java backend engineer, who was responsible for integrating machine learning algorithms, with the rest of the systems. So the interesting thing there goes that it was the first time I started going in with machine learning data science, it was still, you know, kind of the early days of this revolution, snow evasion. The feeling I had when I left university is that there are things related to machine learning, data science, and so on. But it was a bit romantic, it wasn’t easy to apply them in the industry while we were studying. But I don’t want to combine the socket, the point of the syringes, let’s see that. Let’s start with psyche NumPy. And all these students in it were really interesting times because it’s not as easy as it is now. So in order to run a psychic backbone, you have to compile the whole thing from scratch. So it was challenging even to get the things we’re talking about an E zero DOT something version. But what I really liked and what really surprised me back then was how if you have a business objective, and the proper data, and you store the data you can use algorithms to, to make estimates and make guesses or to help improve the world and optimize your objective. And this was really interesting to see matching.

Kostas Pardalis 23:39
Yeah, that’s cool, by the way.

Ioannis Foukarakis 23:41
I mean, we have,

Kostas Pardalis 23:43
Like, let’s say, traditionally, when we’re talking about something like melon data science, we always have Python in our mind, right? Like that’s, like, let’s say the most common like languages in the ecosystem that is used, but you mentioned that you were doing back end stuff like in Java, right. So, how did this work? Like how do you say breeds like Java with the like Python?

Ioannis Foukarakis 24:09
So, initially, we started implementing some of the algorithms in Java back then. So, it was basically it goes, or other simple algorithms like capriotti, or FP go thing, or singular, mean, but then at some point, if you need to work with logistic regression or some other things, then you needed to work with Python, because there were a lot of libraries. So there was a layer of integration that was responsible for gathering the data and sending them to an insertion point. So the Java particles gathering the data, doing all the aggregation and preparation and then sending them to to the PI From code,

Kostas Pardalis 25:01
okay, so Java is doing like, let’s say the more like lines and reading part of

Ioannis Foukarakis 25:07
pretty much every month. But it goes, this evolved packet this company, it goes up to work the English name though this back there. So we actually believe that in some tooling that allows us to have more than that profession that we could deploy, and allow us to work as synchronously running independently. So you could No, we use this tool to give training of models and to keep a log of your experiments. And then the Java code would only need to point to the proper model. Yeah, you were doing

Kostas Pardalis 25:50
like ML ops work, but one ml ops version of the term, right?

Ioannis Foukarakis 25:54
Yes, exactly, exactly. But it’s not only the same, the other parts of the city are important, it’s about making sure that you have the data. So what’s definitely in Florida. So for example, you might even need the customer profile. So you need daily snaps, because it’s hard to go back historically, every day and calculate the profile. And you also need to store these so that you have circled data, so that you can train your model without, you know, having recent data creeping in us past data and all these kinds of problems that can get in the mail. So yeah, that’s definitely also partners, it was part of the working, I think it’s still easing. It’s one of the most interesting parts.

Kostas Pardalis 26:48
Yeah, yeah, absolutely. So let’s talk a little bit more about that. Because actually, it’s interesting. So you mentioned a few of the challenges that you had back then like working, like having these ml workloads? How did you deal with them, like back then and how you will deal with them today. So we can see how these 10 years have changed the way that we are doing things like in their engineering.

Ioannis Foukarakis 27:19
Yeah. So back then, it’s funny because it looked like functionality. So one thing that we had back then is caching the data. So we will be caching the data, storing them into a file system, or S3, and then moving them to a data warehouse. And then we’ll use SQL queries for doing the transformation. And the output of the transformation was a thin data for the model. And something similar for the prediction, although you might need to call some API’s to get more recent data, because they might not be yet available in the data warehouse. So that goes, one thing, this thing’s over time, you know, we shouldn’t have these tools made available to wish for, the advent of cloud computing, and all these nice tools. So it’s still pretty common to quit when you have data to just dump into an S3 bucket for example when you have them available, and then you decide what to do with it. But then you also need to load them to the server to perform the transformation. So for the transformation part, you can either use something like sparkle, the different operations that you have out there. You can use SQL, you can use SQL using something like presto, or Athena, or you can use a data warehouse to do loads of data to the warehouse. So there are a lot of organizations that check Hadoop for all these things. And then it also always depends, it also always depends on the use case. So in some cases, where you do some offline computation, you can just create a buttstock that runs every night, let’s say and calculate some results. And then you cast these results into a database, so that it’s faster to query them. Or you might need a streaming query, so you might need to find a string like Kafka or whatever. And for each item that stemmed out of this name to perform a prediction. So it really depends on the use case in what you want to see if there is anything doing it’s like everything in software, you have to understand what your what’s your objective, and then started working towards what are the best technologies to use

Kostas Pardalis 30:00
Yeah, so if you would like to, let’s say someone comes to you and is like, I’m considering getting into data engineering, like, it’s software engineer, but they haven’t liked work like and did engineering before. And they ask you like, okay, what are the most common use cases, right? Like, what are the most common things that, as a data engineer, like, what you see there, right? Like, what would these be like? What’s the first thing that comes in your mind, like as the, let’s say, these three, four most common like use cases, that pretty much organization out there, when it comes to data engineering, deals with

Ioannis Foukarakis 30:42
The first one is data collection. So you have various sources and orientation. So you have resources, and you want to load them to your systems, or to botanist store them in a temporary place, you can use in bouncing. And this can be either from databases, or, or other systems can be from user actions in events. And you might need this for product analytics and so on. Then, the second part is some transformation in order to build some, some end results that, you know, you gather the data from various sources, and you want to combine them in order to build a story or to try to understand what’s happening. So this is another common case, we definitely need some points to send the data to some other systems on by the company like Salesforce, so I didn’t follow HR systems or whatever. So kind of reverse CTN so that it’s available to sales, to do this disintegration. There’s also the data science machine learning part. So these are the most common things, I think, I might be forgetting something. But yeah,

Kostas Pardalis 32:06
Why is the DS in the middle different from let’s say the rest of the stuff that you’re doing with data?

Ioannis Foukarakis 32:14
So in memory, there’s a lot of exploration. So it’s not like you, you have something solid to work on. So ML is about optimizing things. Most of the time. And actually, this is one of the most important things when working with data, especially with ML. The first thing you need to understand is, what are your business objectives? And do what you want to achieve. One of the most common cases that you might not go as planned is that there is no clear objective. So usually, your objective is not to achieve specific precision and recall, for example, but your objective is to improve sales, for example, or to improve the lifetime prediction or to improve 10, prediction, and so on. And then you use the models in your model, because actually, that’s why they are called models, because we’re trying to model the problem in order to provide an estimation, and so on in these are proxy metrics that you can use to, to work towards your goal. So this is the most important thing to remember.

Kostas Pardalis 33:38
Yeah, and how does it work between like the data engineer and the ML engineer, right, because the data engineer, let’s say, you’re responsible for making sure that the data is available, there are like pipelines that they prepare the data, blah, blah, blah, like all that stuff. And then you have the ML engineer who, as you very well said, like, it’s all about experimentation, right? It’s all about being scrappy, in a way, there’s no order, right? Like, you have to get in front of like a bunch of data and like trying to do something. So how have you seen successful and if you also have seen some unsuccessful attempts, as it would be also great to hear from you, like working together as like their engineers and the male engineers.

Ioannis Foukarakis 34:23
I think for the male engineers, the most important part is to have ease of access to the data and the data have been easy to use. So usually, data scientists and machine learning engineers are fluent enough in SQL learning, not other languages so that they can read some transformation in order to be able to use the models. What might be challenging is call integration with other systems. Although you know, it’s a blurry line. The call is the border of ML engineering, data engineering. So let’s say that you have a monolith, let’s say that your company, so architecture is a monolith, and you want to get the data in order to work with this data, they Mainland’s account, go directly to the production database and use the data from there because they might run save queries, which is really common. So they might need the replica, they might need to combine it with data that comes from CDP or from something external. So they need to have enough three dots in order to be able to achieve their goals.

Kostas Pardalis 35:40
So how would you define the boundaries between data engineering and ML engineering, then like, where do you think that these boundaries would be? Set?

Ioannis Foukarakis 35:53
It’s really hard to answer this one, I think. I mean, listings are continuously evolving over the past years. I mean, he and you know, where quite often the typing non companies need something different in another company? Yeah. So I think there’s a lot of overlap. I think that the data engineer is the person who is closer to the ingestion in loading the data in taking the data quality in all of the scenes, the male engineer is responsible, mostly for making sure that the data are in good enough format so that the data science models can use them. But again, it’s a blurring line. There’s a lot of overlap there.

Kostas Pardalis 36:48
Yeah. Makes a lot of insulin, like, asking the question in a bit of a different way. So what is something that you have to do as part of, like, an ML task that you hate doing as a data engineer, like that you wouldn’t like to do like I, in an ideal world, you wouldn’t like to do that.

Ioannis Foukarakis 37:08
I love Cisco. So I’ve got all these catching trout. So I like challenges. And so I think, yeah, what most people will say is clean data in the expectation that a data engineer has clean data, but it really is happening. And one of the keys is cleaning the data is 80% of your time, or even more. That word, and that’s what was helping. I wouldn’t want us Nimal engineers to have to write ingestion pipelines for multiple sources. So for example, I would prefer that this solves problems when it comes to, you know, cleaning data. So that data gathering thing, in a way so that I can process them all together, I don’t have to build custom logic to Kylin everything.

Kostas Pardalis 38:01
you elaborate a little bit more on that, like you mentioned, like JSON, so like what’s like the, what’s like the, the cab part of like, let’s say the annoying parts of like dealing with that data.

Ioannis Foukarakis 38:17
Broadly short format. So if I am to say I don’t like something, it’s a CSV. So for example, he says, because a lot of standard CSV is not a single formatting. So when misused as a single one, but you know, you need to define separators, escape characters, what you do with escape characters, special characters, and then you know, you have all these peculiarities that some teams have. So I don’t, for example, Redshift has its own data. It’s about handling CSV, and stuff like that. So I don’t know if this is what you’re asking.

Kostas Pardalis 39:00
Maybe actually, like, a very great topic. I have more questions here. So let’s go through like a little bit of the quality of the flow of the work there until the data gets to the ML engineer, right? So the data comes from various sources, and obviously, like in different formats, right? Different serialization. And even in the same, let’s say, serialization, you might have like different schemas, right? So it’s going back like, for example, the weight of the reason like what you submitted in one, like in the contract was about taking the schema of some events, right. So these first parts are like dealing with data, right, like you can have data coming like in Avro data coming in, I like protobuf, CSV JSON. And or like what else? How big is the part of work that you like? The ER has to deal with all these different formats and making sure that they don’t get into the way of whatever happens later on. Right?

Ioannis Foukarakis 40:09
Yeah. So you need to think about the layers, let’s say your data or this own. So what are sometimes called, you have to have something like a landing zone where all this data is landed on your system that you need to start processing and nothing sexy, if possible, to make sure that if something hits the things, you either identify it fast enough, or you raise an error. So you know, if something breaks you, you can, you can figure this out as soon as possible. So again, luckily, nowadays, it’s easy to have to ingest most of these formats, and most of them are pretty common on how to QCon with them. There is a need for you to know the specifics of each format. I mean, because I think the biggest problem is the representation of the data, not the format of the data, the representation. So, by this, I mean, how would you model something about the optional? Would you consider knowledge valid value, or something as a missing value? And let’s say that you have a JSON document? And what does it mean that the property is missing a specific role? It doesn’t mean that some no not that the user identifies them. So this is a bit of the annoying part, because it requires a lot of back and forth with a source. And sometimes you don’t often have access to the team that created this data. But yeah, so you definitely have this first layer to clean the data, they have them, you know, in the form of it’s pretty solid, not super solid, chill, flexible, not it doesn’t fly from the original source party. It does the basic trading, renaming applies, basic conventions and so on.

Kostas Pardalis 42:20
So if we were to talk about data quality, like what are the parameters of data quality, okay, without one of these, we talked about, like the semantics of how data is represented in different formats like and all these things? What is creating, what else creates problems like data quality?

Ioannis Foukarakis 42:42
That’s a great question, and I don’t have one answer. So, I think that each Organization defines data quality in a different way. There are various dimensions with data quality that you can discuss about. But you know, depending on your use cases, and what you want to achieve, you might want to focus on some of them. So, you can think about consistency, like you know, having multiple shots, just rules for the same data, whether the sources are consistent, whether you have duplicate values, etc. You can think about completeness, whether you have missing data, which is also important. You can think about accuracy, track how we present the data to reality. Whether they are expected for one solid second, you have a date, you need to know that each is in the proper format so that it does not get misinterpreted. Where the data phrase presents another one, I can think from the top of my mind and the social. Two more that are sometimes overlooked. So one is accessibility. So how easy it is to access data. So does it take a long time for some members of the team to get access to the data? Or do they have to wait for some, I don’t know, for either technical or business reasons. And find out how easy it is to use the data. So if you just give someone an S3 bucket in terms of files, it might not be nice for them to use. But if you’ve done the invitation and have proper naming the columns, etc. It would be way easier for them to work on. Again, there might be way more that definitely depends on the use case. For example, if you’re working on open source datasets, you might As some of these things might be more important than the rest, or you might want to also have versioning as part of the data quality. So we can definitely look at things.

Kostas Pardalis 45:12
Yeah. So, okay, dealing with data quality, pretty much like I guess, like on a daily basis, what do you think is missing right now in terms of tooling out there to make your life easier?

Ioannis Foukarakis 45:24
I think there’s a lot of tools out there. Right now. I think that they’re trying to make you have a lot of freedom with most of these students, and actually, this is, especially for modern projects, they solve their maintenance, there’s a lot of freedom, or how you structure your project. So there are emerging practices right now. So some of these things have been shown in the past, but you know, we have to adapt them to the new tooling, etc. So Solid definitions of data quality and solid examples on how to measure it is one thing. And then there is, you know, the other challenge I see is that most of the teams focus on specific parts of data quality. So, for example, you might have a tool that focuses on identifying missing values, this, you might not be able to reuse this tool in order to find whether the distribution of the values changes over time, you will need a different tool for that. So it’s becoming challenging, you know, it’s a lot of tools to talk to fishing.

Kostas Pardalis 46:51
Yeah, it makes sense. What’s interesting, I feel like if you think about it, data quality itself is like a lot of requests, like a lot of processing on its own, right? Like, there’s like, analytics that you need to do on the data like just to, like, use these things. And it’s interesting. So Okay, one last question from me. And then like, I’ll hand the microphone back to Eric. What is like one thing that has happened, like in the past couple of months, or like a year or whatever, that in your space, like in data engineering, but like really got you excited for the future. And you can include RudderStack

Ioannis Foukarakis 47:33
in your head.

Kostas Pardalis 47:37
So can be a tool, it can be like a new technology, it can be like in practice, like, whatever.

Ioannis Foukarakis 47:45
So I really like how DBT is maturing over time. So that’s one thing. What I really liked was the sloppy car ecosystem, which the data frames how you can use them. I haven’t used it in production, just you know, professional experimentation. But this sounds like a really interesting approach.

Kostas Pardalis 48:15
Cool, that’s awesome. Eric, all yours again.

Eric Dodds 48:18
All right. Well, I’m actually going to conclude on a question for both of you. And that is, are there any games that you still play? Either on the PC or you know, the console or even on your phone? Candy Crush doesn’t count?

Kostas Pardalis 48:36
Okay, do you want to go first?

Ioannis Foukarakis 48:39
My kid owns my consoles. So I don’t have a lot of time for games. But usually, it’s me helping him on some of the games. So we really enjoy free playing games here like we have in the industry. So we have this Mario Party and Garth and all these things. But lately, he’s been really excited about another game called Subnautica not just survive on gaming, he likes exploring, seeing the world is

Kostas Pardalis 49:18
very cool. And me. Unfortunately, I’m not allowed to get close to computer games. So

Eric Dodds 49:28
That’s because of the consequences you’ve experienced in the recent past.

Kostas Pardalis 49:34
Yeah, like I don’t know. I hope in the future I’ll be able to play again to be honest, by the way. One of the things that I noticed at some point is okay, we used to play like Quake Arena, for example, right? We back then, like when we were in our like, early 20s, or like, late teens or whatever. We were doing something pretty amazing. stuff, like I remember, especially like some folks that were playing with us. I mean, it was like, so hard to beat them, like how fast they were, like, all that stuff. And then I remember like, trying to play one of these games again, like after, like, a couple of years. And I felt so walled like my like, like, you can’t come like, like there’s like zero chances of like being able like to compete

Eric Dodds 50:28
faster edge. Yeah.

Kostas Pardalis 50:32
So I had a friend who was 30 I mean, like another guy who was thanking me, like, say, mates, they’re, like, come back from work, and they get like on X books selling like a gang of old dudes. They get on one of these, like, first person shooters online, they know that it’s going to be a massacre, right? Like, they’re all going to die like they are not going to enjoy, but they figured out a way to win Joy not enjoying the game by just being all together making fun having a beer and like getting on the game and like getting massacred by kids. So I know I see myself like probably being one of these guys one day, but we’ll see. I

Eric Dodds 51:15
I love it. Well, thank you for sharing stories about Quake Arena and naming your characters after your professors Yanni. Incredible story. Thank you so much for sharing, we learned a ton, you know, especially about data engineering ml, on the influence of software development on data engineering. So thank you so much, and congrats again on winning the RudderStack transformations challenge.

Ioannis Foukarakis 51:40
Thanks for having me,

Eric Dodds 51:42
Costas. What an awesome episode with Yanni. I mean, it’s clear that the big takeaway is that if you neglect your quake, Quake Arena practice, those skills will atrophy over time, and will cause regrets for you. actually made me think about Duke Nukem. Do you remember what I did? That was again, like you had those friends who were just like how, you know, how did you get so good at this? Like? Yeah. It’s amazing.

Kostas Pardalis 52:21
It’s interesting how I mean, if you think about, like, because we’ve had like, this conversation with the hands and like, I started like, remembering like, how we were, you know, like playing games and stuff like that back then. And so there were a couple of things like in Quake Arena that okay, you’ve had like, first of all, like, it was crazy to see with a rail downcast, like the aim that some people had, and like how they could do like headshots. That was crazy. I mean, I don’t know what kind of reflexes like this are. I never managed to get to that level. But like, there were people that like when they entered the arena, like you will just leave because it didn’t make sense. Like, it was almost like cheating, you know, and they were not up.

Eric Dodds 53:09
Yeah. And

Kostas Pardalis 53:10
usually the was the result of like, spending way too many hours like laying instead of studying

Eric Dodds 53:17
100% Yeah. Like an effect on your don’t Yeah, I mean, you’re talking about people who would like take the mouse apart and like clean the ball and like clean the mousepad before the game, you know, because they had like a very

Kostas Pardalis 53:33
bull, the bull the bull, like something that doesn’t exist anymore. Okay. Yeah,

Eric Dodds 53:38
exactly. Yeah. But it’s super important. Because like, you know, once you got really good, you can tell if the ball got dirty, like

Kostas Pardalis 53:48
and, yeah, measuring the ping to the server like because

Eric Dodds 53:52
Yeah, well, yeah, sounds good.

Kostas Pardalis 53:55
The other thing that I think like it’s a testament of human creativity here is that there was like these things going on, like the rocket jump, right? Which takes time but with the default settings, you couldn’t do it because you were actually exploding yourself. Right? But we were changing the setting so you could use it like rocket jumping. And that’s like completely like changing like the way that you were playing right so actually, it’s like really interesting to see how people were not just like playing but also how to like innovating on top of like the game to make it like a new game right

Eric Dodds 54:36
100% I think that’s actually a really good you know, that was really fun to talk about that when we think about the episode and talking with Yanni you know who now works as a data engineer mattermost You know, who does really interesting work around super high security team collaboration, you know, for the Air Force and for you know, Bank of America and other huge companies. He’s a systems thinker, right? He breaks down systems. I mean, he studied electrical engineering. And we got a really interesting view of his art, going from electrical engineering, back end software development, and now, engineering and then data engineering and hearing about that story was absolutely fascinating. But it’s true. I mean, it sounds funny, but the way that you talked with him about trying to break down the Quake Arena game and like, execute that, you know, during class and other things like that. It was a bunch of really smart, creative people like solving a systems problem, right. And so that’s really, really cool to me to hear his story. And I think anyone who’s interested in sort of transitioning from different disciplines and taking the best of that discipline with you to the next one, this is a really great episode.

Kostas Pardalis 55:59
Oh, yeah, Tommy was saying like Yang’s like give like, I think like a very pragmatic, like, how’s your description of how, like the fundamentals of the end, like do not change, I think, like you mentioned those, like, a couple of times of like, how we go, like bulking cycles, similar weight, and things that we were doing in the past, we like, do again, like today, and like all these things. And that’s not actually like a bad thing. Like, it’s a good thing like innovation doesn’t mean like, throwing away completely what was happening in the past and breathing like a completely different party. Like it’s much more, let’s say, iterative in a way. And there are fundamentals that they remain there, no matter what, like some things cannot change, like the fundamentals are there. And so, investing time in learning these fundamentals and enjoying working with these fundamentals, I think it’s probably the most important thing that someone can do in their career. And it doesn’t matter. Like if you have them, you can go through software engineering, back end engineering, front end engineering, ml to data engineering, and whatever is next. So, I think it’s a great episode for anyone who wants to learn about that.

Eric Dodds 57:16
I agree. Well, thank you for joining us. Definitely subscribe. If you haven’t, tell a friend, give us feedback, head to the website, fill out the form, send us an email, and actually send an email to Brexit datasets. show.com. He’ll respond faster than VR glasses. And we will catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.