Episode 29:

The Present and Future of Data Engineering with Joe Reis and Matthew Housley from Ternary Data

March 17, 2021

On this week’s episode of The Data Stack Show, Eric and Kostas are joined by Matthew Housley, CTO, and Joe Reis, CEO and co-founder of Ternary Data. These self-described “recovering data scientists” focus on teaching skills to build a solid foundation for organizations to work with their data.

Notes:

Highlights from this week’s episode include:

  • Joe and Matt’s background and expertise (2:44)
  • Common threads and trends in the data sphere (9:39)
  • Differences and commonalities between startups and enterprises and the way they deal with data (18:28)
  • Discussing how the role of data engineering has evolved over the years and what it might morph into in the near future (27:52)
  • The ideal data infrastructure and what future shifts excite them (39:52)
  • How ML is shaping the data space (44:30)
  • The state of real time (49:56)

 

The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds  00:06

Welcome to The Data Stack Show where we talk with data engineers, data teams, data scientists, and the teams and people consuming data products. I’m Eric Dodds.

Kostas Pardalis  00:16

And I’m Kostas Pardalis. Join us each week as we explore the world of data and meet the people shaping it.

Eric Dodds  00:26

We have guests on the show today who I think have a pretty broad view of the data space. It is Joe and Matthew from Ternary Data. And they do services and training for data engineering, and all sorts of different data work. And they’re really interesting guys. My question … I’ll be interested to see what common things that they see among the companies that they work with. I know that’s pretty simple. But just having done consulting myself before, you sort of noticed patterns, you know, getting to look at the way that lots of different companies are trying to solve the same problem. So I want to see what types of things that they see in their work that are common across companies trying to do data engineering stuff. How about you Kostas?

Kostas Pardalis  01:13

Actually, I think this time, I will mainly want to hear the same things as you do. What I would add to this is, by the way, I think it’s the first time that we have people that are coming from a consulting company. So I think we should explore the fact that they have this kind of exposure to many different companies. And they have seen many different ways of implementing data engineering and analytics and machine learning. And also, I’m very interested to see from their point of view, the evolution of the space and this industry, because we are still at the beginning of defining this data related industry. But it’s not like we didn’t work with data in the past. And I think they will have a very unique perspective on that on how things have progressed in the past 10 to 20 years. And that’s super interesting for me to hear about.

Eric Dodds  02:02

Totally agree. All right, well, let’s go talk with Joe and Matthew from Ternary Data.

Kostas Pardalis  02:07

Let’s do it.

Eric Dodds  02:09

Alright, we have Joe Reis and Matthew Housley from Ternary Data on the show. Gentlemen, thank you so much for joining us.

Joe Reis  02:17

Hey, what’s up, guys?

Matthew Housley  02:17

Thanks for having us.

Eric Dodds  02:20

All right, well, so many interesting things to talk about. And I can go ahead and tell you that, you know, having done consulting in the data space myself and knowing all the different types of things you see, we’re just gonna have so many great things for our audience to hear. But why don’t you tell us a little bit about yourselves, just a quick background? And then tell us about Ternary Data and what you do?

Joe Reis  02:44

Yeah, I’ll go first. So this is Joe. And my background’s always been in data of some sort. I’ve been in the data space, since the early 2000s. And, you know, I guess the work I was doing back then would not be called data science. And, you know, had always had a fascination with machine learning. And, you know, got into that around like 2009, 2010, I would think. I started, you know, delving into that, especially, you know, the availability of cloud and, you know, those sorts of resources. Started at an auto ML company in 2012 and then quickly realized that a lot of the problems facing machine learning had nothing to do with algorithms. Even early on then I realized a lot of the problems to make machine learning successful in production had to do with proper data architectures and data engineering. And so, you know, over the years, I’ve been on a crusade to, you know, help companies get more value from the data by helping them build solid data foundations and data architectures.

Matthew Housley  03:45

And I’m Matthew Housley. My long term background is actually in academia. So I have a PhD in math, more the pure side of math. I suspect that actually resonates with a lot of people. I find a lot of data people come out of the math world one way or another including Jones, you have a bachelor’s degree in math, I believe.

Joe Reis  04:03

Yep.

Matthew Housley  04:04

Yeah. So eventually, I had a friend who was more on the statistical side of math, we had done, written, worked on some papers together. And he recruited me to work as a data scientist. So really appreciate him doing that, taking a chance on me. And I started the job. And as a junior data scientist, I learned about a lot of the core tools, really enhanced my Python skills by doing things with Pandas, working with data on a laptop. And at some point, I realized that the laptop-based workflow was extremely limiting. And so it just didn’t scale. And then the other data tools we had available in our organization were extremely slow. And also, in a sense, didn’t scale. They were just too overloaded. And then my focus gradually began to shift toward data engineering. So how can we build large efficient data systems? How can we basically provide systems that will be a force multiplier for a data scientist so they can get off of their laptops and these kind of very circumscribed tools into tools that can handle terabytes and even potentially petabytes of data. This kind of intersected with the company I was working at. It was looking at a cloud migration. And I ran some early projects on GCP, and then worked on a project on AWS, and realized that there was this huge skills deficit around cloud data technologies. So you could do amazing things, even with EMR on AWS, but you had to have the right skills to make that possible. And so around this time, I think it was 2017, Joe and I met. And so we started mapping out the possibility of a company that would focus mostly on data engineering. The end goal would kind of be the same. In other words, our goal is to enable machine learning and data science, but from more of a foundational level with a very heavy cloud focus.

Eric Dodds  05:54

Very cool. And it’s actually interesting, your comment about a background in mathematics that has proven true, at least across our guests on the show. We’ve had many guests who have a background in mathematics. So that’s at least a little bit of anecdotal data to reinforce your hypothesis.

Kostas Pardalis  06:14

Yeah, actually, in mathematics and physics, like, yeah, it’s like these two types of people like they’re thriving in data science, I think.

Joe Reis  06:23

Hey, Matt, didn’t you get your master’s in physics?

Matthew Housley  06:26

I actually did. I started out in physics. And again, anecdotally, my observation is that you have your like applied math, statistical people, and they tend to go really deep into data science and machine learning. And, again, anecdotally, I’ve just seen a number of my friends do this, but people who are more on the pure math side, so like proving theorems, and working in areas like algebraic geometry, number theory, representation theory, tend to drift into engineering, just because engineering, I don’t know, I think it’s the problem solving aspect. Maybe there’s a lot in common between debugging a distributed system and trying to prove some theorem about linear algebra. I don’t really have a good theory. It’s all anecdotal.

Joe Reis  07:05

I think that’s correct. I mean, that was kind of my path. I mean, I was an applied mathematician and was into, you know, analytics, and, you know, more kind of real world situations. So, yeah, yeah.

Eric Dodds  07:16

Okay tons of data stuff, but because I’m going to indulge myself, which I do occasionally on the show, by Matthew just leveraging some of your expertise to help me answer my own children’s questions. So last week, my four-year old, who you know, is counting and, you know, trying to count higher and higher, etc. And we’re driving in the car the other day, and he just said, Hey, Dad, do numbers go on forever? So I want the quick PhD in pure math answer to how do I respond to my four-year-old son?

Matthew Housley  07:52

That is a good question, I might be able to find some good YouTube videos for you, actually. So numbers do go on forever. And Joe is probably familiar with this, too. There are different types of infinity that you have to worry about as well. So the type that we deal with as children is the countable type of infinity, where I can get to any number eventually. In other words, pick a number out of the bag, I count long enough, and I’ll get there. But there are other types of infinity that are not countable. So for example, the real numbers don’t have that property. They’re a larger type of cardinality. And yeah, I’m not doing a good job of explaining this, but I’ll try to point to some resources that might be helpful.

Eric Dodds  08:31

I am super appreciative. And I just learned something really cool. That will surely take me down a Google rabbit hole later this evening.

Joe Reis  08:39

Give yourself a few hours on that one.

Eric Dodds  08:42

Yep, multiple infinities. Light evening reading with your four-year-old. Cool. Well,  I know Kostas has a ton of questions. I’ll kick us off, though. So I mentioned this in the intro. So you run a services business where you do all sorts of things, helping companies make better use of their data, get their data cleaned up, you know, just the variety of things that go with data engineering. Matthew, as you said, to help data become a force multiplier. You have a wide purview of different types of companies doing different things with data and so I’m really interested in some of the common threads you see, you know, a lot of the times we’ll talk about really specific deep subjects with someone working on, you know, say data science inside of a particular company and dealing with a particular type of data. But I’m really interested in your sort of broad range of view, working with a lot of different types of companies. So what are you seeing out there?

Joe Reis  09:39

I think it’s a good question. Okay, so let me caveat this with some anecdotes where I, you know, we obviously work with a lot of companies. And I also talk to a lot of data professionals around the world on a nearly weekly basis and the things that I’m finding are the threads that we’re seeing are actually pretty common everywhere. And so what are those threads? Mainly machine learning’s really damn hard for the reasons that we kind of alluded to earlier in our intro where, you know, it’s easy to get to maybe the 70% of machine learning where you spun up a Jupyter Notebook on your laptop, and you can check that box, that might be a success. But then when it comes to, I think rolling out, you know, machine learning, and I would also add analytics into this to the broader organization of this is where a lot of the challenges start. So part of it is a technology issue, I would say a large part of it’s also an organization issue. Right? So if you don’t have the company on board with these known digital or data transformation projects, it’s going to be incredibly difficult to make progress. And the types of organizations where we see this, they typically tend to be, well, if it’s not data first, i.e. if you haven’t incorporated data into your processes, or a product from the get go, retrofitting, you know, data science, more modern data techniques is actually I would say it ranges from fairly difficult to, like, not gonna happen. And what do you have to say on that, Matt? Am I off base there?

Matthew Housley  11:18

No, I completely agree with that part. Yeah. And I think another big theme, from my perspective and Joe you should weigh in on this as well. But a lot of our clients that are still dealing with on prem systems are hitting a wall there for various reasons. So one possible type of wall is that they’re using an older legacy database, that’s a high quality database, but just they run into scalability limits at some point. So no matter how big your Oracle license is, at some point, you’re probably going to hit a wall on your Oracle license, no matter how many servers you have, in some kind of a cluster, you’re probably gonna hit a wall on that, at some point, you might have a legacy Teradata system, you you re-up to a bigger system, and you still hit a wall on that. I think also, we’ve seen that Hadoop has become a disappointment for many organizations on prem. Because again, you run into those same fundamental issues. Plus, you need really heavy duty expensive engineering resources to run that cluster. So Hadoop turned out to be fantastic if you were the scale of like a Yahoo or a Facebook, and you could just build a massive cluster. And you could have these highly, highly proficient engineers, and you could scale to, you know, 5,000 nodes, and serve the data needs of the whole company. But if you’re a lot smaller, and you’re not specialized in tech, then that is going to become a real problem for you at some point. And I think that is the big driving force behind cloud migrations, moving away from some of those limits, and having limitless scaling as a possibility in the future. I think the other big thing for me is that the data issues that organizations run into, beyond their hardware and technology, that hinder data science come down to really basic foundational problems like data quality, really complex ETL, it takes a long, long time to deploy fixes to data quality issues. And the interesting thing is your data scientists are smart, they’ll often find these data quality issues very quickly as they’re working on a new project. But it might take six months to a year to actually get the fixes deployed with the somewhat non-agile legacy approach to data pipelines that many organizations have. Another big theme is just getting data in and out. So tools like Fivetran, for example, that connect from A to B are just blowing up right now, because that turns out to be a very hard problem, even though it looks simple.

Joe Reis  13:40

And I’d add Rudderstack to that mix too. And the other thing I would add to is, you know, what we are noticing, and this is actually influencing our business model to a pretty big degree now is there’s a big skills gap, right, there’s a skills gap internally with companies with respect to data engineering, and best practices with the cloud. And there’s also a talent gap. So if you want to hire data engineers, you know, unless you have, you know, the cachet of one of the bigger tech companies, or you’re doing something really innovative, it’s really hard to find data engineers. And so we also find, you know, there’s a trend towards, you know, easy to implement solutions and to couple that with training, right. So actually our business model, we actually got out of the sort of butt in chair hours, typical services engagements, because what we realized is with a lot of our clients, what they need is actually if they have a data team, the data team needs skills. They don’t need somebody to come and implement it for them. They actually need somebody to help them implement and coach them with specific technologies and best practices and paradigms because at the end makes the implementation a lot stickier with best practices and so forth. So that’s what we found, you know, sort of our secret sauce. You know, as much as we can we actually don’t touch a keyboard. We teach other people how to level up your skills and become, you know, awesome data engineers. And so we found that that’s a big differentiator with what Ternary does versus other companies and a lot of our partners like this approach, because it makes, you know, obviously, their products a lot stickier for the client.

Eric Dodds  15:15

Yeah, I mean, that that makes total sense. Because it’s easy to think about data engineering as something that kind of has like a defined start and end point, you know, along the lines of implementing a new technology, right, we’re gonna migrate to a new warehouse, right? So you get the new warehouse, and you do the migration, and then great, you’re running on the new warehouse. But in reality, data engineering is something, I mean, we see this, you know, just talking with people on the show, it’s something that’s a constant pursuit. And so I think describing that as the need is more around skills, I think, is a really good point. And we see that all the time. Because it’s not something you’re ever really done with, right? I mean, you can complete projects, and you can build infrastructure. But as organizations grow and change, the needs around data grow and change, right, and the types of data and formats and, you know, delivery, and all those sorts of things are dynamic as an organization grows and changes. So that’s really interesting to hear.

Joe Reis  16:20

For sure. And you know, having the skills is paramount, too. Because when you’re evaluating a new tech stack, right, the number of data tools keeps growing every year. So you know, in fact, Matt and I were looking at these charts of data tools in 2012 versus today. And, you know, you could count the number of data tools in 2012. I don’t know that you could actually feasibly count the number of data tools today. And I think it also goes to, you know, just the ability of a person to keep up on best practices and modern tooling on, you know, the best tools out there, like that has just become exponentially more complicated. And so again, like even to evaluate a new tech stack means you constantly have to keep staying on top of stuff. Because with the number of tools coming out, there’s going to be new approaches, you know, to data and yeah, like, I think to your point: nothing’s ever static in this industry. If you’re static, you know, you might do that to your own detriment.

Eric Dodds  17:16

Sure, sure, absolutely, we were actually just talking to someone earlier today. And they made a really interesting point. They said that there’s a big gap, you know, a lot of the, you know, even data tools will sort of paint a picture with their marketing messaging that doesn’t necessarily reflect how much work it actually takes to get to the end destination, right? It’s like install this and then all of a sudden, you’ll have XYZ result. And in reality, it’s even worse, even if you get the tooling like making it all work is just a gigantic effort. Well, one more question, then I’m going to hand it over to Kostas. So one more question on my end, because I’ve been monopolizing here. So we talked about the common threads, I’m interested to hear about any differences you see across companies, and maybe you don’t, but I think about things like are there sort of challenges that are different maybe among different types of business models or particular challenges you see it companies have a certain type of scale, you know, maybe startup versus enterprise? Are there any sort of differences you see, across the work you do with because you work with such a wide variety of companies?

Matthew Housley  18:29

It’s funny, a lot of the differences we see are technology differences that are reflections of underlying cultural issues. So I think at this point, we do see a lot of companies that have struggled with data, but at least understand that data is really valuable, and so are willing to make the investment, they maybe need a little bit of direction about where the wind is blowing, and like where to try to make those investments. Whereas others see data as just this really expensive hole that they dump money into and are very stingy about what they put into it in terms of money, obviously, but the technology and the people as well. And they kind of wonder, okay, why is my data terrible? Culturally there’s a disconnect between their core business maybe, and the fact that data can help them. They just don’t quite understand at the top level sometimes.

Joe Reis  19:18

Yeah, it’s more reflective of … are you guys familiar with Conway’s Law? So you are okay. Cool. Yeah. So I mean …

Eric Dodds  19:26

It may be good to just do a brief overview, just a brief run through for our listeners.

Joe Reis  19:30

Yeah, directional is, in the data world, kind of like the term, like seasonal in marketing, you know, it’s like, well, this is weird, or this doesn’t look right. And it’s like, well, it’s seasonality, you know, it’s, it’s just seasonal, right? And, right, and data, it’s kind of like a catch all for like why things aren’t right. And, it’s funny, thinking about the word directional, like you said, I mean, the data is directionally correct. Right. And that usually means that there’s bigger problems under the hood. All right, Kostas, I’ve been monopolizing.

Joe Reis  19:30

Yeah. So for, for the listeners out there who don’t know what Conway’s Law. Conway’s Law basically says that an organization will develop modes of communication based upon how the organization communicates, right? So they’ll build systems around how the organization communicates. So if you have a very siloed organization, your systems will represent silos, right. And if you’re a very open communication format, then you’ll develop systems that work accordingly. And so what we find with respect to data, especially, is data is different than technology with a few areas, right. So, you know, application development, for example, that tends to be focused on particular use cases. But you know, quite a few departments in a company use data, right? Whether it’s reports that come out of the ERP system, or you know, any number of things. And, as well, but when when technology fails, you tend to notice this, because you’re, you know, maybe your application stops working, right. And  it’s pretty obvious or you like, if you’re maybe developing an application and the test break, right, and so you have a pretty good understanding of that. Data is a much different story, where data issues may persist for months or years, and nobody knows the difference. Right? So that’s a big issue. And I would say that when you start hearing things like, oh, well, as long as the data is directionally correct, that’s a pretty big red flag, that, you know, data needs to, should be addressed. Not to say it will be, but it should be. And so with that said, it tends to be the companies that I think are investing heavily into, you know, technology, if they need to transform digital transformation, because inevitably, data transformation and data value happens from those sorts of endeavors, or we tend to find this when, you know, companies are not trying to transform at all that tends to be where data kind of goes to die.

Kostas Pardalis  22:04

You did well, Eric. It was a very interesting conversation so far. So guys, a quick question. I noticed that you have your journeys, like you started from an academic background like science, mathematics, physics, then you went to data science and from there to data engineering. Can you take us through this journey and what you saw out there as data scientists that made you realize that you want to focus more on data engineering, and give us some examples of that?

Matthew Housley  22:41

Yeah, I could start out. So I think I was raised, like a lot of data scientists on tools that run on a laptop, very heavy focus on Python, on R, and develop some Panda skills, some R skills in terms of being able to analyze data frames, and run Teradata query. That would take a while to run because the system was quite overloaded, I would download the data, I would load it into Pandas and then I would discover that I needed really a different sample of data. So I’d go back and run another Teradata query. And then I would try to transfer some of my workflow directly into SQL just to run on Teradata so I didn’t have to go jump through quite so many hoops to get from A to B. And that was super slow. As my queries began to scale up. We also had a Hadoop system that had more event-oriented data. I tried to run a query on there, and it would take like three to five minutes, at best. And so the turn time, the workflow was just extremely slow, like the iteration time to try to run new queries. And then if I needed to do something beyond a SQL query in Hive, then I further had to download that data to my laptop, oh, it’s too large. Okay, go sample it. Pull it into Pandas again, and try to do some analytics that way. So I think a lot of my experience comes back to this cliche that data scientists spend something like 75% of their time, just trying to pull the data, trying to acquire it, trying to do some basic filtering, in order to attempt to do data science, like the first steps. And at some point, I realized that you had these tools in the cloud, like Elastic MapReduce, like Redshift, and Snowflake, and at some point, BigQuery. And you could dynamically scale up to a huge number of nodes. Yes, it would cost some money, but you’re only paying when these tools were actually turned on. And so I think that’s what really turned me on to the idea of doing data engineering in the cloud. Suddenly, we just had the scalability and resources of a much larger company at our disposal. And at some point, I started using Databricks as well. And now you can kind of transpose those laptop-oriented workflows into a data frame environment that was much, much more scalable and much faster. And so given that so much of my time was spent on just trying to address these core issues, I started to have this realization that deploying these cloud tools could speed up that workflow dramatically. And then at some point, I started to become kind of the point person for the teams I was on to deal with these kinds of issues, deploying Databricks and training people how to use it, and enhancing people’s SQL skills as well.

Joe Reis  25:32

Yeah, and I think, you know, on my end, you know, when I was getting into the ML space, there wasn’t, I mean, there was a handful of libraries that, you know, enable, you know, made ML simple, but there was nothing in the way of proper tooling, right? I mean, DevOps was still sort of being figured out in real time by a lot of companies. And so we, you know, in my case is, you know, I think it was more just having to figure things out from the ground up, because there wasn’t a playbook on how to do whatever you call data engineering now, and oh, and to some extent, ML engineering as well, right. So a lot of this stuff was yours kind of having to make it up as you go along. And so, with that in mind, I think it had always been a mission to, I guess, to make things better, or at least hope to try and make the world simpler, just because I felt it was horribly complicated. And then I, you know, it was interesting, because around that time of like, you know, deep learning becoming the hot new thing, it must have been, like, 2014 or 2015. And then, you know, a lot of my data science friends, you know, and acquaintances are all asking, so why don’t you want to get, you know, why are you calling yourself recovering data scientists right now? Like, surely you must be crazy. And I was like, well, it’ll make sense when you’re older. Cuz, I mean, I’d seen a lot, I’d gotten a sneak peek into the problems, you know, and so it just made a lot of sense. Why, you know, as Matt indicated, you know, developing Jupyter Notebooks was, I mean, it’s great, you know, knock yourself out, but, you know, to make this stuff work in production, there’s just a lot more work that needs to be done. So that, you know, it felt like, at least when I was getting into data engineering kind of around, you know, the early to mid 2010s, it wasn’t really a field then, either, right? I think my titles at the time were like software engineer, because there wasn’t, it wasn’t even a title for data engineer. But even though we were doing good engineering things, and so you know, I think, yeah, those experiences informed it. Yeah.

Kostas Pardalis  27:30

Yeah. So guys, I mean, we’re talking a lot about data engineers, but what makes an engineer or a software developer, a data engineer? And the question has actually two parts, one, in terms of like, the skill set, like what kind of skills someone has? And also, what is their role inside the organization? What does the data engineer do?

Joe Reis  27:52

I think the role of a data engineer is to help take the raw ingredients of data and ingest those, process them and then make them useful for analytics and machine learning? So I think if I were to say what the role is, in a nutshell, I would think that that’s it. What do you think, Matt?

Matthew Housley  28:13

Yeah, I’ll comment as well. I agree with Joe on that part. And I think in the last five years, there’s been a huge shift in expectations of what a data engineer’s role should be. Five years ago, 2016, you would see a lot of articles talking about how if you wanted to make a lot of money, you should go learn Hadoop, like low-level Hadoop, learn how to manage a cluster, learn about installing the software, learn about creating data pipelines, raw MapReduce jobs, maybe jump into Spark, that was the way to like be a very competent data engineer. I think in 2021, the emphasis has shifted much more towards stitching together a lot of pre-built pieces. So if you are using Spark, it might be something like Databricks, or EMR or data proc on Google Cloud. And yes, you’ll need to do maybe some low level tweaking, but you’re not going to spend time creating a cluster and managing hardware and these kinds of details that used to be a big part of your job. You’ll also probably use a lot of completely off the shelf tools that are turnkey, you might use Snowflake or BigQuery. And you might orchestrate those tools, using something like Apache Airflow to make them work inside of a larger pipeline, get data into cloud storage, pull it back out, do interesting things with it. That to me kind of distills the skill set and the role but you asked as well about the organizational role, I think Kostas.

Kostas Pardalis  29:37

Yeah, that makes sense. And I think from my point of view, like the way that I see the role, I think there is an interesting combination of tasks that in classic software engineering, you have the SRE, you have the DevOps, and then you have the software engineer, right. I think of a data engineer, okay, it might depend on the size of the company. And like the scale of the problems that they are solving. They pretty much have to do a little bit of everything. But as you said, one thing is like stitching things together, maintaining and making sure that the infrastructure is working, rewiring the infrastructure, because it requires a lot of changes, not just like maintaining something. And writing code. I don’t think that you can even like with tools that they are not supposed to need, let’s say, a lot of coding using something like Fivetran, I mean, still, someone needs at some point to create a DBT model, right, and interact with the data. It might be SQL might be Python, or it might be all of these things together. So for me it’s like a very interesting problem, it’s a very interesting evolution of the role to be honest, because I don’t know, in my mind, at least, I don’t know if you agree. I mean, we started with a DB admin, back in the like late 90s, beginning of zero. And we ended up like, today talking about data engineers.

Joe Reis  30:59

I think that a data engineer, like at the end of the day, like your job is to really, you know, as tools become simple, it’s still I think you need to have a really good idea of the data lifecycle. Right? So yep, ingestion, storage, processing, transformations, etc, I think to your point, so that doesn’t go away, I can actually see a day though this might be a bit heretical, I think the term data engineer may morph into something different. I mean, you’re seeing new buzzwords, like analytics engineer, I’m not saying buzzwords, like these are practical titles, right? So ML engineer and so forth. And so you know, and it’s gonna be a lot more fine grained, just like data scientists, right? Like that was kind of a catch all term where you had to be kind of good at everything. Like you’re kind of the crossfitter of data, like, yeah, you’re not really good at like anything in particular. But you’re amazing at everything. So I see data engineering sort of morphing into that because it is true. I mean, you point out the word DBA. I mean, I see data engineering job postings that are basically a DBA job, right? Or an ETL developer.

Kostas Pardalis  31:57

Yeah, by the way, now, that you said that, how did you see the role of data engineering changing inside the organization based on the size of the organization? Do you see data engineers working in like startups compared to big established commerce? Do you see a difference there? Does the role, like evolve or change depending on like, the environment where you work at?

Joe Reis  32:18

Yeah, definitely, like startups definitely tend to be a, you know, more of the quote unquote, full-stack data person. Right. So I mean, I don’t know if that ever disappears entirely, just because you’re resource constrained. And so whoever you hire is gonna have to basically figure a lot of stuff out. But yeah, as you get into, you know, I think more established companies, the role of a data engineer is a lot more defined. And I think that the nuances are more defined for that particular company, because again, a data engineer or data scientist, depending if you go to any of the FAANG companies, it’s all different and let alone like, all the millions of other companies across the country, right.

Kostas Pardalis  32:52

Yeah, how big usually are the teams of the engineers based on your experience?

Matthew Housley  32:58

That’s a good question. I think we’ve seen a lot of data core data engineering teams of 10 to 20. I don’t know if we can expect that to fragment in the future. But I’m thinking of like a couple of billion dollar a year revenue companies that had teams and that kind of size range. And they were just responsible for building out and maintaining a lot of pipelines and interfacing with parts of the company across the organization to provide resources to them, basically, what are your thoughts, Joe?

Joe Reis  33:26

Yeah, I think it’s about right. And again, it just depends on the size of the company. Right. And it gets … although other roles are split out, I mean, sometimes you’ll see a software engineer doing a lot of data engineering work, or, you know, a data scientist doing data engineering work. And so that’s, and that’s just usually means like the titles have yet to settle. So, yeah, but again, there’s kind of, it’s a weird thing where there’s cargo pole things like job titles, right. So yeah. You know, you just pick a job description from some other company, though, that looks good. We’ll just take that one. Yeah, they might read them. Sometimes they might not. I don’t know.

Kostas Pardalis  34:04

Makes sense. All right. So okay, I think we covered a lot around like people in organizations. I think it’s time to talk a little bit more about the technology. I think you mentioned earlier how much the technology landscape has changed from 2012, where everything was around Hadoop, and Spark was just starting until today. I mean, I think there’s a huge, very exponential growth in terms of the tools that are available out there. And I think you are the best people to talk about this evolution. Can you give us a little bit of your experience with how these tools have evolved, and some actual tools that you find as really, really important for the job of the data engineer?

Joe Reis  34:47

I think the one thing I noticed is a lot of the things that were popular back then so you’re talking about your Hadoop, Sparks and whatnot. It’s interesting because the evolution is, with a lot of those tools is, you know, depending on the type of company around depending on your skill set, but the data warehouse has come back into vogue, right? So a lot of things that you could do in Spark, I mean, you can also do that in SQL using Snowflake, or BigQuery, or Redshift. And so what I think what, you know, I recall conversations back in the day, like, oh, SQLs dead, data warehouses are dead, like, you may as well just learn, you know, Python and Scala and call it a day. And I still think there’s a time and place for that discussion. But increasingly, the, you know, the new generation of cloud data warehouses is extremely competitive with these, you know, these older, quote, Big Data Tools. Additionally, when you’re talking about the streaming and things, I mean, that’s, I think, still a work in progress. But, you know, I would say streaming and data warehouses are two things that I’ve seen that, you know, are kind of taking a lot of attention from people, I would say, data warehouses, data warehouses are a lot more understood than the streaming part, which we can get into it in a bit. But what are your thoughts Matt?

Matthew Housley  35:57

Yeah, I think it’s funny, a lot of these tools that started out being targeted at developers. So for example, Hadoop, back in the day, when Hadoop started, if you wanted to write a data processing job, you were going to be writing Java code and writing MapReduce steps. What happened? Eventually, Hive came out, and now you can do all that in SQL and it turned out a lot of the mindshare started shifting toward Hive because even really sophisticated data engineers didn’t want to be spending all their time writing MapReduce jobs. We’ve kind of also seen the evolution of a lot of more traditional Big Data Tools into the data warehousing space, maybe I shouldn’t say traditional, these aren’t that old. But it seems like nothing we use these days is particularly old. But for example, I was using Databricks a couple years ago, and at the time, it was very clear that Databricks was shifting toward being more of almost like a data warehousing hybrid with data lake model. Initially, the idea was I can take this raw, unstructured data and do just about anything with it. But over time, they shifted their focus towards schema management, or delta lake toward management of table changes and things that data warehousing does more traditionally. Another tool that’s shifted that in that direction is Imply, they started out being very, very just focused on real time. And now they also advertise themselves as being able to serve this data warehousing need. And so it does seem like data warehousing, and SQL both have made a huge comeback. The other really big technology shift is just in terms of how you purchase and deploy these technologies.

Matthew Housley  37:32

I think back in 2015, go back further to 2012, the cloud was perceived as a toy for companies that weren’t big enough to have their own data centers. And I think in 2021, there’s this realization that it makes more sense to deploy your capital to the cloud and let someone else take care of a managed service for you be that Google BigQuery, or Databricks, managed open source or managed proprietary, and let your data engineers focus at a higher level, and let someone else take care of a lot of the behind the scenes details and tuning.

Joe Reis  38:08

Yeah, that’s a good point. Yeah, I would say the last five years especially has seen sort of the rise of trying to eliminate as much undifferentiated heavy lifting as possible, in the data stack. Whereas before, it was almost like how complicated can I make my stack. And then some companies wised up and found that maybe taking a more simplistic approach had value. And sure enough, those companies are now you know, in a lot of cases, unicorns or soon to be, and so that’s kind of cool.

Kostas Pardalis  38:36

Yeah, yeah, I think that’s also a big part of the success of Snowflake, to be honest. I mean, they managed to start as a data warehouse, actually, it’s very interesting, because if you see the evolution of how they position the product in the company, I mean, it looks like you see their diagrams, it looks like the data warehouse was their MVP in a way, which is very interesting, because, I mean, it’s a pretty complex thing to build, right? But today, if you go to their main website, they don’t even call it a data warehouse. They’re called, like, the Data Cloud. And of course, their bet was on clouds. And I think that was cloud and self-serve, right? Because back then, when they started, like, you think about Redshift, right? Redshift was still, I mean, it was on the cloud, but it was still a bit of a pain to manage. It wasn’t that you still have like many knobs that you had to play with in order to optimize it. Or then at some point, you had to rescale your cluster, and that was a major pain, and you had to go through downtime. So I think Snowflake really reflects like the evolution in this space. And I think it’s, it’s very interesting. So what kind of stack are you excited about I mean, if you have to build like an infrastructure today, what are the tools that you would choose and also what tools like you really enjoy working with?

Joe Reis  39:55

I mean, we both do a lot of work in Snowflake and BigQuery from a data warehousing angle. I would say that those are two that we’re excited about, just because I think they’re both pleasant to work in. The things I would say, I don’t know, before I keep blabbing, Matt, what are you excited about?

Matthew Housley  40:11

Oh, no, I completely agree. It’s funny. I think a lot of data engineers still perceive data warehousing as very unsexy, it’s like, well, it’s just a data warehouse; runs on SQL. But I think the exciting things about Snowflake and BigQuery are that you can just drop a couple petabytes of data in there if you want to. And you can be running these extraordinarily huge queries in short order. And so that means that if I am a large company, and I have, you know, petabytes of data on prem, but the hard part is just shipping that data to the cloud. But once it’s up in the cloud, I can do these amazing things with that data. And then I can start hooking in other tools as it makes sense to do. So if I really need the power of Spark, I can plug that into Snowflake or BigQuery. Very quickly. So yeah, I find those are a pleasure to work with. And I find them both exciting because of the degree to which they can scale so easily.

Joe Reis  41:05

Yeah, I would say the things I’m excited about, or actually, the ML Ops tooling space that’s fascinating to watch unfold in real time right now, I have no idea where it goes, honestly. But I don’t think anyone in the industry knows either. But that, to me, is fascinating, because, you know, the practices of data engineering, I think, have pushed the maturity of analytics for a lot of organizations. And simultaneously, there’s then you know, this undercurrent of people working in the ML tooling space. And so I’m very excited about that, I would almost say more so than even the stuff happening in data engineering, like Snowflake and BigQuery are great. I consider those to be sort of the cool stuff in the present, the things I’m personally interested in or, you know, continuous learning, real time systems and how that impacts business, Matt and I are just having a chat about that the other day actually, just like the coolest stuff that, you know, could possibly happen when you have genuinely real time systems that are, you know, taking automated actions, and just what that means for businesses and how it impacts people.

Kostas Pardalis  42:08

Yeah, yeah, that’s really interesting. Actually, we had our previous episode, which was with someone from Tecton. And so we were, we were discussing about feature stores. And I mean, it was very useful for me, because finally, I figured out what the feature store is, or at least what we believe today. And it was amazing that you have like two or three years now, so many talks out there about feature stores. And yeah, I mean, like, the industry is still trying to figure out what this thing is. We all feel that we need it. But okay, how do you define it? How do you describe it to someone?

Joe Reis  42:46

Well, it’s interesting, because like, you know, in January, you know, Josh Tobin, who’s, you know, he teaches a full stack deep learning course at Berkeley. But he did a talk at one of my meetups, he talked about the evaluation store, which is just brand new concept that nobody had really heard of, until he unveiled it there. And you know, then, who knows what kind of stores you’re gonna come up with next? I don’t know, or other other technologies. And this, this may be entirely new ways of thinking about things, right? Because what I see right now, in the ML space is like, you know, people are taking the best practices they’ve seen from DevOps and data engineering and data ops, whatever that is, and trying to make sense of the landscape. But I’m almost certain I’d be willing to bet that somebody comes along with a completely different approach to things. Because in the DevOps space, for example, that’s still, people are still trying to, you know, make progress with that as well. It’s not like the DevOps spaces haven’t done anything. It’s like, it’s just 10 years ahead of where data is basically.

Kostas Pardalis  43:41

Yeah, yeah, absolutely. Yeah. I think we’re just in the beginning of shaping this space. It’s going to be a couple of very exciting years, I think. So guys, one last question from me. And then I’ll hand the microphone to Eric. So you mentioned at the beginning of the discussion that we have that ML is hard. And at the same time, I think we talk a lot about ML. But most cases, if you ask the people that are excited about it, like some specific use cases of ML like it’s not that easy to come up with them. Can you share with us the most common use cases that you have seen of using data in ML  context. But in general outside of the typical BI, which is extremely well defined, we all know what BI is about and how it is used. How is ML today implemented? What are the most common use cases that you have seen out there?

Joe Reis  44:30

I would say there’s certain tertiary problems with the business, right? So when you look at, I always sort of evaluate this from my you know, again, this is from a business that isn’t including ML in its product where you’re doing maybe image recognition for an app or something like that, right. But if you’re a business, the things that you really care about, are likely revenue related. So if you have enough history forecasting, that’s a really big thing, especially if you’re if you have a supply chain, you’re going to need a forecast period. You operate without a forecast and a supply chain at your own risk. There’s that. And then there’s also customer retention, and churn and those sorts of things. So those tend to be like the most immediate things that pop up, where it’s, you know, if I have customers, and I have some data, which customers are going to churn, and then, you know, how can I take an action to maybe prevent that from happening? That tends to be the first order things we notice. And then obviously, if your econ, recommendation engines are a really big one, and you have any other ideas, Matt?

Matthew Housley  45:30

Yeah, yeah, honestly, I think one of the most common applications I’ve seen, and this will probably resonate with you, Eric, is just very basic ad tech. And I don’t mean like building some kind of advertising system, I just mean, understanding who your customers are, who’s likely to buy from you and feeding that data into Facebook or Google ads. And you would be surprised how often that’s not happening at all, where they’re just hundreds of millions of dollars being burned without a lot of clarity on what’s going on, or certainly not a lot of feedback to improve that loop and the efficiency of that spend. Now, one thing that’s gonna be interesting to watch now, is this tightening of data privacy practices, and how effective these advertising practices are going to be in the future. I suspect some companies may have just missed the opportunity and the window on some ad tech may be closing in the near future.

Joe Reis  46:24

But also add to that, I think that any operation …  so ML is a really good fit when you have operations that are happening at such a high volume, or at such a fast rate, that it’s really difficult for humans to keep up. Right. So anytime you have that, that’s a classic use case for implementing ML, I would say some anti patterns that we often see is using machine learning on BI data. And you might ask yourself, Hmm, that’s interesting. I make models from BI data all the time. And I would ask you, okay, so in your model, using your BI data, how much are your features correlated with your label? Right? I’d be wagering to guess that in a lot of cases they are pretty highly correlated, because you can answer a lot of questions using just the data model that you already have, assuming that you’ve modeled it correctly. And so this, I noticed this, because I was at an auto ML startup where we dealt a lot with, you know, ingesting BI data. And over and over, you know, when I started, I sat back and thought about the problem they’re trying to solve is like, that’s a SQL statement, actually. So because it wasn’t automated in such a fashion, right, where you would get this feedback loop with your ML, which you know, in a model, which in turn, would would help improve processes, it was very much, you know, what I see often is people will make these models and they’ll just be the static models. But when you look at it, when you step back and look at it, that’s actually, they just made a report.

Eric Dodds  47:50

Interesting. Yeah. Yeah. And going back really quickly on the advertising use case, Matthew. I agree, it’s, it is amazing how challenging it is to get. I mean, reporting. To your point, Joe, like, the reporting around sort of, like full funnel attribution and marketing involves so much more data engineering, and pipeline work than you would guess, right? I mean, it’s sort of like this horrible, gnarly, especially when you sort of are crossing different platforms, right, going from, you know, web to mobile, or, you know, marketing, web to product web, and you’re trying to tie all that stuff together. It’s just, it’s so hard to … I mean, it’s not like the technology doesn’t exist. But I think to the point that you’ve made multiple times, on the episode, crossing the organization in order to do that, and I mean, a lot of BI’s the same way, right. It’s just really, there’s so much involved in getting all of it together and getting all of it right.

Eric Dodds  49:03

We are close to time here. But I have one more question. Joe, you mentioned real time. So I would just love your perspective really quickly on the state of real time. It’s one of those marketing type terms where it’s used very liberally. And anyone who works in data engineering knows that, you know, we’re still in early innings, right? Like, it’s pretty, it’s certainly feasible and a lot of companies do interesting things with real time. But I think we’re still in early innings. I just love to hear from your perspective, when you see companies trying to do real time. What do you see on the ground? What are some of the current technologies that they’re using? And then what types of things do you see coming in the future that will be sort of the game changers as far as real time goes?

Joe Reis  49:56

Real time is most effective when you’re able to take automated actions against real time data. Right? So an example would be, you know, like IoT. That’s a classic example. Data is flowing in. And, you know, you’re going to use that data to do something, right. You can certainly store that data into a data warehouse or data lake for, you know, kind of after the fact analysis. I would say the state of like, real time analytics is a really interesting one. And Matt, I’d like your thoughts on this, too, we see a lot of companies wanting to do what they call real time analytics. But, you know, if you take an extreme example, say data shows up every millisecond, and it’s updating a chart, I guess our question is, what action are you going to take with that chart? Right?

Eric Dodds  50:47

So who’s just sitting there looking at the chart all day long?

Joe Reis  50:53

Right? Yeah. So the natural next question is well, okay, this kind of goes back to the machine learning discussion, we’re talking about high volume, high velocity data, where humans can’t react in that, speed, right? That’s where automation comes in. And so that, to me, I think that’s where real time is heading, there’s sort of a fascination, I think, like a gee whiz, oh, I can like, you know, binge watch my business. And, you know, watch everything in real time. I’m like, you should, if you have to binge watch your business, at that extent, you have a really shitty business, actually. So you shouldn’t have to pay that much attention to minutiae, right? Sure. It’s crazy. It’s like, it’s like watching your hair grow in your arms or something. You know what that said, I think the future of it is, you’re just going to see machine learning is going to be a lot more tightly coupled to real time systems. I think whenever continuous learning is figured out and working at it at scale, I see that as the next natural evolution. What are your thoughts Matt?

Matthew Housley  51:52

I would say two things. So the way real time is marketed right now tends to be pretty problematic. I think it’s pitched as this universal solution, boil the ocean, replace all of your batch systems with real time. And companies get into it. And they discover that it’s very expensive. And it brings a whole host of new problems. In many cases, these companies were already struggling with their batch systems. And now the struggles just explode in doing basic things like joins, and suddenly becomes very hard. Having said this, I think it’s a very promising domain, I think most companies of any size have some problem that would be enhanced, where they could solve that problem better by using a real time system. And so my recommendation generally, when people are talking about real time is like, okay, what, what problem are we interested in solving here, and this goes back to what Joe was saying, like, you want to couple it with some kind of automation. So let’s find a place where real time can actually have an impact, and deploy it there. And then we can look for the next use case. In terms of the technology and how that’s changed, I think the huge difference now, today from say, five, six years ago, is that I have these off the shelf really nice real time solutions that manage all the layers for me, because in the past, deploying, like a lambda architecture, would require a huge team of very expensive, insanely good engineers. And now a lot of these data warehouse products actually have off the shelf, real-time web architecture built in. It’s just managed for you and taken care of. And so it’s fairly turnkey, if I can identify that appropriate problem, then I can start doing things in real time pretty quickly and start seeing value that way.

Joe Reis  53:38

I think the expectation is everything is going to be real time. And actually, Google’s Dataflow paper, I think, had a really good distinction between real time and batch, and what their distinction was, instead of thinking about it in terms of real time, or batch, or streaming, think about things as bounded or unbounded by time. I mean, we only do batch right now, because it’s an artificial distinction that we have to do because of technical limitations, right? So you time bound your data, but in essence, all data is actually unbounded. And so the closer you can get to sort of this organic feel with data and events just sort of happening as they happen, like, you know, the rest of the world operates, you know, like the actual, you know, world and universe is real time. It’s all event driven. Humans are the only ones that seem to batch things up. And it’s more just because it’s convenient, not because it’s actually how things work. Sure, you know, so, you know, what does the future of batch look like, in the real time world, I think that they’re actually synonymous because batch is actually a subset of real time. And when you take away the time bounded constraint, you know, it is the same thing. So it’s a sort of transitive property of time boundedness of data and events.

Eric Dodds  54:52

Very, very elegant explanation there, Joe. That was wonderful. I think it’s great. Well, we are at time here. And this is a great show. Really good conversation. I learned a ton. And we’d love to check back in with you at some point in the future. And have you back on to hear about the new things you’re learning as the space unfolds. So gentlemen, thank you so much for joining us. Thanks.

Joe Reis  55:19

Thanks, Eric. Thanks Kostas.

Eric Dodds  55:22

Well, that was a really interesting chat. I think, beyond learning that there are multiple types of infinities, which is still bending my brain from a mathematics perspective, I thought that the way that Joe described real time data, and the distinction between batch data and real time data as really just sort of a distinction that we use, because it’s an easy way for us to, to sort of digest the concept. But he said in reality, all data is real time. And as the technology catches up, we’ll see that concept play out more and more in companies. And I just, I really appreciated that I think he, in a really clear, concise and elegant way, made that distinction for us.

Kostas Pardalis  56:12

Yeah, absolutely. I mean, that was a very … I think he had a very elegant way of explaining and describing this fact that at the end, batching is just a convenient approximation to reality that we humans do, because we’re constrained by the technologies that we have. But I think that this is going to change. And I think it’s changing already. We see more and more, let’s say event driven, streaming based like approaches to problems that traditionally were tackled by batching mechanisms. Outside of this, which, of course, I think it was amazing, like the conclusion to our conversation with them. I found it extremely interesting, this whole journey of starting, like from science going to data science, which it’s a pattern that I think, Eric, we have seen a lot in this show. But as the next step for them going to data engineering, because they they figured out that, that data engineering is like the real problem that has to be solved before we figure out the most, let’s say, sexy way and complicated cases of machine learning, like at the end, if you don’t solve the problem of the quality of your data, for example, the availability of your data, your model is completely useless. And yeah, that was, I mean, I know that we are both aware of this fact. But it was great to hear that from these two gentlemen, especially because of the experience that they have and all the different components that they have helped so far, like building their data infrastructure.

Eric Dodds  57:43

Totally agree. Well, thank you again for joining us on The Data Stack Show. Be sure to subscribe on your favorite podcast network if you haven’t already. And that way you’ll get notified of new episodes every week and we will catch you next time.