Episode 49:

MLops – The Finalization of the Data Stack with Ben Rogojan of Facebook

August 18, 2021

Kicking off season three of The Data Stack Show, Kostas and Eric converse with Ben Rogojan, perhaps better known as Seattle Data Guy. In addition to being a data engineer at Facebook, Ben is also a consultant for a number of smaller organizations and shares his insights on some of the unique issues data engineers face at companies with such a vast difference in scale.

Notes:

Ben’s background and his shift to data engineering (2:19)
Trends in the data space: finding the most efficient tools, the Snowflake phenomenon, and keeping up with new functionalities (5:33)
Key differences in data practices in small companies and Facebook-sized companies (12:38)
Having to build tools specifically designed for Facebook because of SaaS product limitations (16:00)
Team structure at Facebook (18:17)
Developing more robust systems that are resistent to pipeline failure (19:50)
Defining data stacks (24:01)
A sample data stack for a young company (28:37)
Why Redshift and Snowflake have trended in the opposite direction (33:02)
BigQuery and Snowflake comparisons (36:06)
MLOps and whose responsibility is it (39:12)
Feast, Tecton, and feature stores (45:40)
Having a good community around an open-source product (49:30)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:06

Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. Welcome back to the show.

Eric Dodds 00:28

Today’s guest is Ben Rogojan, and he is known online as the Seattle Data Guy. Many of you may follow him on Twitter; he has lots of followers. And he has done a lot of work with data in his career. And he has a really interesting set of things that he’s doing now. So he works as a consultant. So he helps companies figure out their data stacks. And then he’s also a data engineer at Facebook. And so that leads me right into my big question. You know, I love asking consultants about what they’re seeing on the ground, just because they have a wide field of view. But I want to hear about the difference between what he sees on the ground as a consultant with smaller companies. And then what he’s dealing with at Facebook. Facebook is one of the FAANG companies. It’s so large that I think a lot of us have a hard time comprehending just even the types of problems that they deal with. So I just want to hear him talk about the differences there. And I think it’ll be really interesting to hear the difference between those two experiences.

Kostas Pardalis 01:26

Well, I don’t think I’m going to differentiate much from your questions, Eric, I’ll just maybe go a little bit deeper on the technical side of things. But yeah, I find it very, very interesting to see what the differences are between an organization like Facebook and the rest of the companies out there. And I think that Ben is like the perfect person to talk about that stuff with.

Eric Dodds 01:49

Great. Well, let’s jump in and chat with Ben.

Eric Dodds 01:53

Ben, thanks so much for joining us on The Data Stack Show. We’ve followed you on social for a while, love your content, have learned a lot from your content, and are really excited to have you as a guest on the show.

Ben Rogojan 02:04

Thank you. Thanks so much, Eric. I really appreciate you guys having me on today.

Eric Dodds 02:08

Cool. Well, let’s start where we always start and give us a little bit about your background and kind of what led you to what you’re doing today, and then talk about what you are doing today.

Ben Rogojan 02:19

Yeah, no, yeah. So just to give kind of a background of like where things started in my journey on data really started back in college before I even knew data engineering or even data science, in some regards, was a thing. And I took like an epidemiology course. And then I think it was also taking some computer science courses at the time. And I was kind of enthralled by how you could use data to like drive results. Like you learn about John Snow in epidemiology, I think it’s the first thing most people learn about is how he kind of used data to figure out cholera, and things of that nature. And I was like, man, if only there was a way you could combine statistics and programming. And then I think it took like three more months to figure out that that was a whole kind of rising field at the time, right? Like 2012 was the year of that whole Harvard article coming out about the sexiest job being data science. And so that’s where I originally started. I was like, really into data science. And then eventually, I think, as I started working, and kind of maturing and figuring out what I liked, I tended to like more the engineering side of things. So I started flowing more towards the data engineering, data architecture side of building things. And so then from there, it was like working at some healthcare companies and startups working on their kind of data flows and data stacks and data pipelines. I, at the same time had started a consulting company in data that was, again, initially started in data science, but kind of shifted over to more data engineering and data architecture, and then eventually shifted over into working in big tech at Facebook as a data engineer. So that’s kind of been the whole kind of quick flow for me from school to kind of career. And then again, now I’m kind of doing the Facebook thing, while also having a consulting company that I’ve been operating for a little bit at this point. And we’ve got a few clients where again, we kind of help them develop their whole data stack and whatever that again, we can discuss what that means here in a second. But helping them figure out what tools fit best for them, whether it be database, data storage, kind of going through different options, especially right now with so many tools that are really coming out and I think changing the game. Snowflake’s been out for a while, but I think the more and more I play with it, the more and more I see people use it. So things like that. Just other data warehousing tools, tools like RudderStack and things of that nature. So yeah, I think it’s kind of where I’m at right now where I’m really enjoying both of those fields, or both of those kinds of roles and getting a chance to see a lot of different perspectives and how people are using data and trying to use data.

Eric Dodds 04:34

Awesome. It’s interesting starting out in the healthcare space. So there are lots of data concerns there that are the sharp end of dealing with some of the issues around governance and compliance. And so I’m sure that was just a really good entree into dealing with some of the more difficult challenges around data. I’d love to dig into both sides of your work that you’re doing as a consultant and then at Facebook. We’ve talked with several consultants on the show before and we love hearing about what they’re seeing on the ground. As a consultant, you get a breadth of view across many companies who are trying to solve problems that are similar or at least contiguous in the data space. So why don’t we start with what you are seeing on the ground? What are trends? You mentioned that more and more companies are using Snowflake, but even outside of tools, are there architectures, team structures, problems, or solutions that are really interesting that have popped up over the last couple months?

Ben Rogojan 05:33

Yeah, I think one of the big things that I’m seeing is people trying to readjust or trying to figure out how to do more with less. It’s becoming, I think, for most companies just apparent that data is growing at a rate that if you were to continually hire data engineers at the rate required based off the current solutions that you might be using, that it will be come just vastly expensive to pay enough data engineers to manage all of the various data sources that you have, right? Like, I mean, small companies these days that are in the, you know, eight-figure, seven-figure range, will have like 20 to 50 data sources possible. And hiring one data engineer for that, it’s gonna be 150 to 200k, depending on where you are, is barely feasible, and having to hire more is not. So I think a lot of companies are looking to different solutions in that regard, whether whatever place that might be, again, Fivetran is a popular one or a different one, depending on where in that data flow you’re kind of looking to fill in that role. But I think that’s one thing people are trying to figure out how to do more, with less. In a weird way, I think some people feel like data engineering is gonna go away in that regard. But in my aspect is just like, you’re just gonna have to … data engineers are just going to be more capable, kind of like how software engineers have been amplified over the last couple years with cloud becoming more efficient in terms of how you can actually deploy code and things of that nature much easier rather than having to get a server and have like five people just to stand up a little bit of code. Now you can have one or two really solid engineers kind of manage a whole flow. So I think that’s kind of where things are shifting in data: we’re trying to figure out how do we have one really solid data engineer manage a lot more with tools and the right solutions, rather than spending tons of time putting together patchwork systems built up with like cron, and PowerShell, and scripting and bash and things of that nature. So I think that’s one trend, I think I’m seeing.

Ben Rogojan 07:19

I think Snowflake’s been an interesting trend. Because I think at this point, I’ve had like five or six combinations of clients and proposals that I’ve written all around that. So that’s been like, interesting, in terms of more like tools that people are selecting. I think it’s just one of those interesting things. Because Redshift, I think, has kind of started this off. And I think then BigQuery and Snowflake have kind of had this like quiet popularity, where either they’re doing better marketing, or I don’t know what it is about those two that tend to just get decent traction at least in people’s minds. So I’m still trying to figure out that whole thing more from the perspective of someone constantly coming into these projects and seeing those same two things, rather than seeing something like Redshift, or like Microsoft. Microsoft products have kind of been a little newer. So that could also be a reason why Microsoft’s a little behind. But, yeah, why those two seem to be doing kind of well, overall, is something else that I’m kind of thinking about, and trying to figure out how to work with as a consultant.

Eric Dodds 08:12

Yeah, it’s super interesting. Before the show, we chatted a little bit about ML and so I want to dig into that subject a little bit later in the show. But we’re seeing a lot of companies leverage BigQuery for some of the native ML functionality, which is super interesting. It kind of in some ways allows teams to punch above their weight class when doing certain things. But one final question on what you’re seeing. And I agree, it’s super interesting. The idea of data engineers going away is a fascinating conversation in and of itself. There’s no way that’s going to happen, of course, in my opinion, but I’d love your opinion on what I will call the gap between companies who are actively figuring out the frontier of data using these new tools and processes. And then a lot of companies there are a lot of companies who just don’t even know that some of this stuff exists. And it seems like that gap is widening. And the companies that aren’t adapting, it seems like they will really, I mean, depending on the business model, they’ll really struggle just because operationalizing data is really the way that, you know, we shaped the experiences that modern consumers have come to expect. Do you see that happening? Where there’s a lot of companies that are like, man, I’ve never even known you could connect all this stuff.

Ben Rogojan 09:26

But yeah, I know, I do kind of get that feeling where it’s like, and I think part of this is due to the fact that there’s kind of information overload, right. Like, it’s like, what is the right product? So I think sometimes people might be focused on older products that have existed forever. I mean, they’re still kind of looking at that as their solution. Or maybe they, you know, whatever, they might have been viewing the products that they’ve known forever, and seeing the limitations that they had maybe 10 years ago and thinking that they still have those limitations. And I think that’s kind of like maybe what’s holding people back like I think even with Tableau I remember there was a point where I didn’t realize … I don’t remember what feature they’d recently added for a while there, I didn’t even realize they had added … but there was one that was pretty pivotal in terms of whether you were to pick Tableau as a good data viz solution or not. And yeah, if you’re not staying on top of all of the stuff that’s changing in every space it is what can really keep people behind. I did a video on a data engineering roadmap. And I kind of had a joke within the first five seconds or maybe it’s like 15 seconds, where I picked a picture or an infographic of like, the data in the data space tool, tool wise, right? Like one of those VC kind of infographics of like, all the tools based broken down …

Eric Dodds 10:37

Yeah the ones that like all the VC firms have been like making their architectures and visualizing that.

Ben Rogojan 10:41

… yeah, but like, you literally couldn’t tell what you were looking at. There were so many. And I think that’s currently one of the major problems. Data ingestion alone has like 50 tools you could pick from, right. And they range all from like tools that have been around since, like, 1995, and tools that came out just last year. And so like, I think that alone can make it very difficult in terms of like, knowing what exists, and knowing what each of these things do and why you might want to use one or the other. So I think that’s one of the big things. It’s just so hard to keep up.

Ben Rogojan 11:13

And then the other thing is, like, I think some companies are … I think all the big, a lot of the bigger companies I think are catching up, I think some of the more mid and small companies that are finally like, that’s where the big gap is, some of them are like either reading all the medium articles or whatever they’re doing, and then they’ll contact me and be like, we want machine learning and they’re too far ahead. Or something like that. And then there’s some people who are just so busy and they’re probably day to day in their ops that they just don’t feel like they have the time to keep up. Or maybe they don’t feel like it could help them. So I think that’s another hard place for some people. They don’t realize that maybe there’s something that could help them. So yeah, it’s probably one of the gaps.

Eric Dodds 11:49

Yeah, super interesting. Okay, let’s, let’s change gears a little bit. So that’s what you’re seeing on the ground on the consulting side of things. But you also work as a data engineer at Facebook. And what really intrigues me about that is you, you get to work with companies as a consultant, who are just orders of magnitude smaller than Facebook, and then you also get to see, okay, what does this look like at scale? And I know, we could probably do a whole episode on that. But I’d love for especially the people in our audience who are maybe working at, you know, a company in the mid market, or even a small enterprise company to hear what are the unique things that you’ve experienced as a data engineer at a company that’s the size of Facebook, a true international enterprise with a massive engineering team?

Ben Rogojan 12:38

Yeah, I mean, I think I that alone, you just kind of stated on the one the major differences right there is that you’re talking even like, like you said, like a mid-sized company, or even some large Fortune 500 companies just, they have a engineering staff of you know, 2,000 people maybe at a larger Fortune 500. And if you’re small, mid, or something like that, maybe it’s a few hundred people that are part of your engineering staff, and then you’re trying to compete with companies like Google, Facebook, Amazon that are gonna have engineering staffs of 15-, 20-, 30,000 engineers, right, and just 5,000 of them might be focused on more of the enterprise side and developing enterprise systems and architectures and like just services in general that make your life so much easier if you work at those companies versus again, small and mid cap companies, you’re likely either relying on pre-bought products, or maybe you’re trying to put out your own internal solutions. But obviously, it’s just hard to commit at the same time, the same amount of time towards this. And then you add in the fact that many of those mid, small, and even larger Fortune 500 companies have been around so long, they’re still relying on like source systems and operational systems that are maybe super functional and developed amazingly, but maybe their analytic systems are just antiquated or developed in such a way that it was developed for 20 years ago data sets or something of that nature. So then you’re also having to re-migrate and do all this other stuff, just to get yourself to a point where you can actually build systems that act like a larger company like Facebook, Amazon, and so on. And so I think that’s been the big difference I’ve noticed. And I’ve talked to people like, I don’t know if you know, Veronica Zhai from Fivetran. She kind of comes from finance and banking at JPMorgan. And she’s kind of brought that up as well, where their op-side is amazing, but their analytics side was pretty terrible when she first started. Yeah. And she spearheaded that whole kind of development on developing their whole ETL and whatever. And that’s what eventually brought her over to Fivetran was like dealing with all this terribleness. And then seeing Fivetran as a possible solution for herself. So that’s like one one kind of area that I think companies are gonna continue to deal with, right? Like you’ve, you’ve dealt with these systems and they work well, but they’re old or they’re not as easy to work with because they were developed for a different thing. Yeah, I think that’s one of the key differences.

Eric Dodds 14:53

Yeah, that’s interesting. And I want to just touch base on one thing you said that is pretty mind-blowing. You said, maybe you have, I mean, I know you’re just spitballing here, but 5,000 engineers who are focused on building tools that make the engineering team’s life better. And it’s just crazy. You know, there’s companies that IPO with far less total employees than the number that are working on a very specific set of things inside of Facebook. And that scale is just mind blowing. I’d love to know, I mean, just out of personal curiosity, but I think for our audience as well, at that scale, you’re probably building a lot of things internally, because you outstripped the ability of even what we would say like enterprise-grade SaaS products can manage at scale. And I know that different teams for different use cases will probably use various SaaS products. But I would be surprised if you didn’t have a lot of homegrown solutions just because there isn’t SaaS that’s built to manage what you’re facing, because not a lot of companies have faced that before.

Ben Rogojan 16:00

Yeah, and I think it’s something that when you join Facebook, they kind of bring up just like you think about when Facebook came about, you think about Amazon and AWS, and when it was kind of developing its things and there was no AWS or not to the same degree, at least, when Facebook was dealing with their problems, right. They had to develop all of their own solutions, essentially. So yeah, I mean, like, I think that that’s the thing, like, there weren’t even options to some degree. And so a lot of companies that especially deal with that size of data have to develop their own. Again, whether it’s Google, Facebook, Amazon, I think that’s right, like that’s why Amazon developed their products. Originally, it was for themselves and then originally, or then eventually realizing they could sell them and develop their own cloud service. But yeah, it’s a combination of both like, yeah, Facebook probably can build their own or at least better integrate. I think it’s another thing, right? Like, regardless of the SaaS product, there’s only like, there’s usually limitations to integrations, regardless of how well they’re often developed. It’s just hard. And you’re always going to be limited by the SaaS provider, right? Like, if you’re working with Salesforce, you’re only going to be able to do so much like it’s pretty customizable. But there might be a point where you’re like, Oh, I just wish I could do this one thing. And there’s no engineer, you can go out and be like, Hey, can you do this thing for me? But if you build it yourself, you’ve got a whole team and you can be like, Hey, can we get this feature going? And at least that’s a little more feasible, obviously, then you’ve got other problems with internal people making choices on what features to work on. But at least there’s a little more control where you can go to that team and be like, Hey, could we get this feature? We think it would really change the workflow. So I think that’s another reason. It’s not just about scale, it’s also about having that ability to integrate at a very different level than most other companies.

Kostas Pardalis 17:38

This is great. Ben, can you give us a little bit more information about the teams of the engineers and what that work looks like in a company the size of Facebook, especially for data engineers?

Ben Rogojan 17:50

I guess I’m curious on what specifically you’re looking for.

Kostas Pardalis 17:53

Yeah, it’s more of an organizational question, to be honest. I mean, we are used to thinking of data engineering teams to be like small teams relative to the product engineering teams that we usually have right. And of course, they don’t reach the scale of what Facebook has. How does the scale affect how the team is structured? That’s the essence of the question.

Ben Rogojan 18:16

Yeah, I’m gonna guess it’s very similar to plenty of other companies where it’s like, oftentimes, I think Facebook tries to support a product with a team of data engineers, right. Like that way you’ve got good integration with both sides, both the software side and the analysts and data scientist side. So depending how big the product is, could change how big the team is. But overall, you’re trying to support some product with some data engineering team that that way, you can kind of be one pipeline where it’s like you’ve got to have a good relationship with the software engineers, they have a good relationship with their XFNs on the other side, and everything runs smoothly. I mean, I think there’s always gonna be a problem with and this is something I see, regardless of whether you work at Facebook, with Facebook or other companies, is software engineers, I think always tend to be focused on functionality in terms of we want to make sure it works and care less about like data. I mean, obviously, they need data in terms of like, making sure their product is up to date, right? Like if someone clicks or post something they want to make sure that that information gets stored, but logging and things of that nature is kind of secondary, right? Like if the product works, do you need to log things? So I guess that’s generally the one interesting thing that I’ll often see.

Kostas Pardalis 19:22

Yeah, that’s super interesting. Do you also, I mean, we have in our minds that data engineers are mainly building and maintaining data pipelines. Right. What else do you see getting done by the data engineers? Like, do you see them building like internal tooling for example? Is this something that you see? And if yes, how is this managed? Like, how much of the work of a data engineer in the future you think is going to be something like that?

Ben Rogojan 19:50

I think there’s definitely always going to be kind of a need to build internal tooling to kind of abstract as much as you can away in terms of building data pipelines. Right. It’s a balance between abstraction and building maintainable systems. But I think that’s always kind of a goal in general data engineers, not just to build pipelines because they can build that with some Python scripts, but also figuring out how to build more pipelines more effectively in terms of variables, right? Like, if you have to manage 1,000 pipelines, how can you manage 1,000 pipelines easily, because that’s, it doesn’t take much for a pipeline to fail. Like, I think that’s the one thing I found interesting about, regardless of the company that I’ve worked at, is it doesn’t take much for most pipelines fail, it could be one column changes, one data type changes, and regardless of how much you maybe make some component inside that whole pipeline, maybe a little more robust, so it doesn’t get impacted, there’s always somewhere downstream that maybe does get impacted, maybe a table in MySQL, or something of that nature, or in your data warehouse. So I think, yeah, I think trying to figure out how to develop systems that are more robust in that sense, or at least can make it easier to manage when things do go wrong, or provide better notifications, whatever it might be. I think it’s kind of a role of a data engineering suite. We tend to know what we’ll need. Again, I think it does depend on the company you work at as well, I think if you work at a large company like Facebook, you’ve got, again, tons of software engineers that are probably building a lot those products, you work at smaller companies elsewhere, even things like Lyft, I’ve talked to people, you’re tending to play a little more of a software engineering role, and not purely focused on just like data pipeline. So yeah, I think it also just depends on the company and how much support you have from maybe software teams that maybe purely develop that kind of like data instrumentation or something of that nature.

Kostas Pardalis 21:33

Yeah, yeah. I think that that was, by the way, like an excellent point, what you said about pipelines failing. And that’s like, regardless of the size of the company. And I think that’s also like, a space where there are many opportunities for products also like to probably be created exactly. Because like the concept of like, observability, let’s say that we used to have, in a typical, like software product, it’s not so well defined when it comes to data pipelines and data in general, and probably need some kind of like different approach. So it’ll be super interesting to learn more about how you manage this problem. And like, what kind of lessons you have learned from trying to do that. But before we go deeper into this, a couple of months ago, we had another episode with someone who actually came from Facebook and his name is Ivan. And he left to start a company called Slapdash. Actually he took what he learned inside Facebook and the problems that he had to solve there. And in a way productize it right, like he came up with a product. I don’t want you to tell me like, exactly, but do you have the feeling that we might see something similar coming out from Facebook also for data-related products?

Ben Rogojan 22:52

Yeah, I mean, obviously, there’s multiple reasons. I probably can’t speak to that. I think overall, the answer is I have no idea. Right? Like I already said, like I kind of said earlier, right? Like Amazon, built all this stuff internally, and then started selling it. Facebook, in the sense, has built a lot of similar things, but has never sold it. Why, I’m unaware of. It’s not part of my purview. Also, even if I was, I imagine that would be something I would have to double check with someone before I would say anything.

Kostas Pardalis 23:21

Yeah, of course. Makes sense. Okay, cool. So let’s get a little bit more technical. And let’s discuss a little bit about data stacks. And based on your experience, because you have experienced, like both extremes, probably through your consulting career, but also like working in a huge organization like Facebook, you have seen many different, let’s say, versions of what we keep calling data stacks. So based on your experience, first of all, what is the data stack? Like what parts of the software that the company is using to operate should be called the data stack of the company?

Ben Rogojan 24:01

Sure. I mean, like, I think it’s, again, like I said, it’s pretty broad. If you’re talking purely about the analytics data stack, I think you start with raw data, and you go all the way to like data storage, data viz, and maybe some light data analytics. I mean, if you want to include some ML stuff in there, you could if you really want to go that far, I think it’s definitely like that tip of the iceberg kind of data stack stuff. So that’s why I’m not as focused on that when I refer to data stack. Yeah, I really focus on that raw data, like data ingestion, data storage, data transformation, and then some sort of data viz or whatever your final data product could be, because I think there’s some data viz. I also think that data products don’t always have to be a dashboard. I think there’s plenty of examples of like, other forms of reporting that you could consider kind of your final product. I think one of the things I’ve been recently trying to work on is building something like NerdWallet has this calculator that is a cost of living calculator, and they basically scrape a bunch of information from different sources. Put it all together, and then you can now put in I want to move to LA, I’m currently making 150k, and it’ll calculate how much you should make. And it kind of gives you some other information about like, how pleasant it is to live there based on like some walking scores and other information you can pull from API’s and then like cost of rentals and other things that they’ve kind of pulled and aggregated together. So I think that’s also kind of like less of a data viz and more of like, a data product side where I kind of put that in the data stack as well. Because like, it’s part of the whole flow. So yeah, so that’s kind of, I think, the steps that most people will reference and what I kind of consider it, right, in terms of like, what is the data stack?

Ben Rogojan 25:33

And it’s so broad in terms of like, even starting all the way to raw data, like what does it mean? It’s like, well, that raw data can come from everywhere, right? Event logging. SFTPs from other companies, and like external files from other companies, scraping things from online, pulling government data from online, pulling from your various API’s and marketing tools, and Salesforce, and things of that nature, streaming data that maybe you’re getting in. So it’s just so broad that like even that alone, it’s like that’s where it all starts in terms of complexity. And like probably the hardest part and why so many companies, I think right now are focused on the data ingestion layer, because it’s like if you can do well and develop a good product for data ingestion, you’ll do okay, in terms of like a product.

Kostas Pardalis 26:16

Yeah, it makes a lot of sense. Did you see something changing on like the broad definition of data stack, like based on the size of the company?

Ben Rogojan 26:25

I think in a weird way, it’s like a lot of companies are getting access to a lot of tools that they never had before. Like, I don’t think the term data stack was almost used the same for smaller companies up until now, at least what I feel, right. Like a lot of companies, up until now, you had your 30 Python scripts that you ran or shell scripts or whatever you prefer to script in that you managed in cron. And it worked fine, because you only had five data sources. Now that companies like all of their products are SaaS, all of them have API’s, or at least a good portion of them have API’s. Being able to switch over to some more well-defined components, I think is what’s personally I’m seeing more of a switch. We can actually switch over to something just so I don’t say the same product over like, like, like rivery.io, or Airbyte, I think those are two other kind of tools that are looking to fit into the data connector, data ingestion layer, you can use those instead of having to, again, develop a bunch of custom scripts. Because again, I’ve created four Salesforce connectors in my life already. Right. So yeah, it’s like we’ve all created the same things over and over again. And it makes a lot of sense that someone tries to sell that and productionize it. So yeah.

Kostas Pardalis 27:40

Yeah. So if you were to advise, let’s say, a young startup that has their first customers, they create, like some revenue, they have to do some reporting, they have to understand a little bit better, like how their customers interact with their product. What would be, let’s say, an ideal data stack that would make sense for a young company. And the reason I’m asking that is because many times we tend to over engineer solutions. It’s not like you need to operate a Kafka cluster just to move your data round, right? And it’s like a very common mistake that people make. And it costs a lot, both in technical data and in time, and at the end they end up with the results that are pretty much noise, right? So can you give some advice, how you would structure let’s say, the data stack again, for a young company?

Ben Rogojan 28:37

Sure, I mean, I think raw data, it’s, again, hard to say like, how are you going to pull that off? Just because it depends how you’ve developed or like, what tools you use. But yeah, beyond that, right, like, I think tools like RudderStack or Segment can work well in terms of trying to log and just getting a lot of that information out there initially, and getting it to the right place. I usually switch between Fivetran and Airflow depending on maybe what a company’s technical knowledge as well as maybe price sensitivity. I think, for example, Fivetran can end up being very expensive. But if you can afford it, or if it really does help you because you’ve got enough data sources, you’re trying to manage it. That can be very helpful. But I also think Airflow is kind of great. Overall, decently simple, and you can automate pretty well. I think the one hard thing is some people have a hard time managing Airflow because it does tend to be a little bit finicky for some people, but that tends to be more of the coding side of what I’ll use, rather than trying to develop my own thing. In terms of, maybe data storage, I think at this point, I will probably switch between BigQuery and Snowflake. I think those are kind of two favorites or if they’re not like using tons and tons of data, like it’s really small, I’ll even use Postgres just because it’s like it’s small. If it’s okay, look, this is fine. We don’t need to do anything crazy or go pay huge costs for crazy optimizations. But if you have a lot of big data, Snowflake and BigQuery are great. Also I feel like Snowflake is the Apple of data warehouses. It just has this feel to it. Like I don’t know why I like using my Mac or my Apple in terms of my laptop. I just do. I don’t know why I like using Snowflake compared to some of its counterparts. I just do. It’s easier. I don’t know why. Yeah. I don’t know what it is. I just like it better. Okay. I don’t know why, maybe it’s just the branding. I don’t know. They’ve got something there. And then data viz, I think I still generally, like as much as I think Looker is kind of what is often named as like the modern data stack tool, I think I still just prefer Tableau’s usability. It’s just so much easier to build anything very quickly. And if you know what you’re doing, I think it’s fine. I think Looker, the one thing is right, like it has ml, not machine learning, but it has its models and things of that nature that you can kind of define things a little more, which some people prefer, and I get that but I think if you’re safe with Tableau, I think Tableau has just got easier usability and you can build up something so quickly. And that’s generally what I still prefer in terms of data viz.

Kostas Pardalis 31:11

Yeah, it makes sense. Makes sense. I actually found it very interesting that you mentioned Postgres. Especially when I talk with young companies, like the first thing that I asked them when they’re trying to figure out their data stack, let’s say, is okay, are you like a B2B or a B2C company? And that makes a huge difference in terms of at what stage of the company data, especially the volume of data becomes an issue, like a B2B company, I mean, you can pretty much grow a lot and still just use Excel documents, in some cases, especially if you’re like focusing on large enterprises and stuff like that, which is, of course, completely different compared to building a marketplace or building an app, DoorDash, even at the early stages of DoorDash, like, the amounts of data generated might be like, huge. And it’s a very common advice that they also give, like, just use Postgres. I mean, Postgres can scale to quite a lot of data without having to go and get yourself into using something like Snowflake, which of course it has, you put it very well, I think that this parallelism with Apple is amazing. I love it, it feels nice, but at the end, it’s like, I mean, come on, dude, you don’t need that to answer like a few small queries, right? Yeah. Like just use Postgres, as you would also do for your, for your products. So that’s super interesting. You mentioned Snowflake and BigQuery. Right. There’s also Redshift. And it’s a very interesting story. Because Redshift is the first product in the cloud data warehouse space. Right. But we tend to not talk that much about them today. In your opinion, what do you think went wrong with Redshift? And what do you think Snowflake did really, really well?

Ben Rogojan 33:02

Yeah I mean, like, this is my personal opinion, I think Redshift is just, it’s not that different from most data warehouses, but it’s almost too different. I feel like it’s not like in my own mind, like, there’s so many nuances on how it works and how, like, you have to be just that much more technical in using it and making sure you’re like using it properly, then I think, maybe some of the previous tools, or at least, like people were technical using like Oracle and MySQL Server in terms of like building their data warehouses back whenever but like, I don’t know, it just felt like such a shift in how you thought and how you design right, like you couldn’t run updates, right? Like, that was like a weird thing. I think they might have recently added that. But there were little things that like, classic data warehouse modeling wouldn’t necessarily work well. And I think that kind of took it’s like, people were like, okay, so if I want to run like an insert merge, and do slowly changing dimensions, oh, I can’t, or I gotta like, do this weird thing and add, like, do two tables kind of thing, like have a staging table, have the current table and then create a new table based off of that. And so I think that might have been a little bit clunky. I think that’s the biggest thing. I think it’s clunky. And again, going back to Snowflake, Snowflake just operates. It is how you think it should work. And I think that that’s what makes it different. It’s the same way like why do people prefer Macs over Windows? Like Windows can sometimes be clunky. I don’t know what it is about it. It just feels a little clunkier than it would, you know, with Apple. I’m not a designer, so even for me, the reasoning eludes me. It’s just like, that’s usually my descriptions like, well, this one feels clunky. That one feels smooth. That’s what I can tell you. I like using one, I don’t like using the other.

Kostas Pardalis 34:39

Yeah. Yeah, I totally agree. I think that’s like the feeling that you get from Snowflake is that it just works, right? Yeah. And if you have to scale up or scale down, again, it just works. I remember having to deal with Redshift. I don’t know how it is today, because again, I think they have changed quite a few things. Since the product has matured a lot. It has some kind of parity with the rest of the data warehouses out there. But having to rescale your cluster was a nightmare. You had downtime, for example, right, or having to vacuum your data, which is, okay, it’s kind of like a relic because they built the distributed system on top of Postgres. Postgres has this concept of vacuuming. And then they also introduced stuff like deep copying, and then you have vacuuming. It was a lot of, let’s say, not unnecessary work, but it could be very inconvenient when it shouldn’t be inconvenient. And that’s something that from a product perspective, I think Snowflake did really, really well. And that’s, that’s amazing. But on the other hand, BigQuery is not that different, right? When I first tried BigQuery, it had pretty much the same feeling. It just works, right? Yeah. Why do you think that Snowflake is much more successful in that way, or we’ll hear more about it than BigQuery.

Ben Rogojan 36:06

I don’t know if their marketing is better. I think maybe that’s possibly one side of it. Like I think their branding in general has been, like I recall, back in like, it must have been like 2015, or something. I went to a meetup, assuming it was some sort of like, tech talk and about like, 30 minutes, real 30 minutes. And I realized it was like a sale, like basically just a sales guy trying to sell Snowflake. But even that was like all the interesting things they were talking about, like the design and how they made different things of that nature. So I think they’ve just been building on it for so long, that I think that’s kind of helped. I just think it’s been a lot more of a branding thing. And I think it’s easier to brand than it is to brand BigQuery, which is connected to Google Cloud. And so it’s hard to maybe separate from that, where at Snowflake, it’s like it’s just Snowflake, there’s nothing else connected to it.

Eric Dodds 36:56

I agree with that. I was gonna say if you think about Redshift, it had the first mover advantage, right. And so the product itself, in terms of being a new solution in the market, was sort of groundbreaking, just because of the nature of the product. And, and having the first mover advantage. Google is such a huge beast. And so I wonder, and this is complete conjecture, but you kind of have a strange feeling about using your free Gmail account, and then dumping all of your customer data into BigQuery. And it’s the same company, that’s just a little bit of a weird perception, like you said, where it’s like, well, I mean, BigQuery is an awesome tool. But from a branding standpoint, I think it’s really hard to overcome offering, like lots of free consumer products, and then building a brand around corporate enterprise trust when it comes to your most valuable asset, whereas Snowflake, that was all they did. And so it was a much more straightforward branding exercise.

Ben Rogojan 38:00

Yeah, I’m not a marketer, but I assume. I mean, like, again, I obviously, you’ve noticed, I do play some marketing. But I think even then, it’s like, just, for me, it’s sheer force of marketing, like, I’ll just keep putting out content, but I don’t exactly have a marketing strategy.

Eric Dodds 38:17

Yeah, well, if there’s anyone out there in the audience who has an informed opinion on this, we would love to have you on the show to discuss it, because we love talking about the Battle of the Warehouses. Ben, one other thing that we had chatted about before the show was what we call it, I loved how you said it, the other side of the stack. And I love that visual, right? Like, there’s a lot of the core components of the stack. And however you want to architect your data stack. There’s this other side of the stack that doesn’t get discussed a lot, I think for a number of reasons. But it’s machine learning and the ecosystem and process and tooling around machine learning. And that’s been of interest to you lately. Can you tell us just start off by saying like, what is it when you think about machine learning and MLOps. What are the types of things you’re talking about? And why don’t you think that they’re at the forefront of the conversation with the data stack?

Ben Rogojan 39:12

Sure. So yeah, when I referred to MLOps, it’s almost like everything but the model. I mean, in some ways it is the model, obviously it plays a role. But there’s so much other stuff that encompasses getting some form of model out into production. And I think this is such a constant problem for anyone who’s ever built a model that you realize you understand how to build a model and maybe that’s what you learned in school, maybe that’s what you learned in a boot camp, but you never learned the other side, which was okay, but now how does it go into production? How does it maintain? How do we make sure it’s still operating correctly over time? How do we deal with various problems that you can deal with in terms of keeping models up to date? What if problems start occurring? What if data drift starts occurring? Things of that nature become kind of challenging. So I think I think that’s kind of what I refer to as MLOps, which is like, all of the stuff around the ML model, which is kind of to me very similar to data engineering, which is like you have the data pipeline, but then around the data pipeline, you have so much infrastructure, that’s just there to make sure that that data pipeline operates smoothly, and you get notifications when things go wrong. And I think that’s a space that I think is going to continue to grow over the next few years, just because we’ve had now a decade or so of big tech companies and other you know, tech companies developing their pipelines developing best practices, like you said earlier, we have enough people with entrepreneurial drive that have worked at those companies that will now probably take those learnings and develop products to send out to other companies in the next couple years. So that’s one reason or one area that I’m kind of focusing on in terms of MLOps, in terms of why it doesn’t necessarily get discussed is, I think the term itself is still kind of coming into its own. I think it’s like only really, like there was like a paper, I think in 2015, that kind of kicked off the idea. But I think if I recall, the term started getting used like 2018, 2019, in terms of MLOps. And I think that’s one reason I think people are trying to just now get attention to this idea of like, okay, in order to actually get that model out into production, we need to have a system. We can’t just push it out there and not think about how it lives out there in the wild. It has to have something more. And so I think that I think it’s just more as companies mature in their data, understanding and data or like infrastructure, they’ll eventually get to that point, but I think a lot of companies aren’t there yet. Right? Like, they’re still trying to gather their data and manage it in such a way that makes sense. And so the next step after that would be like, okay, now that we have it all, we’ve done dashboards, we’ve gotten all the value out of this “low hanging fruit”, how do we really drive that next level? And that will be where they’ll learn about, okay, I want to develop this machine learning model. Okay, wait, I’ve developed it, now what? And so I think that’s generally what most new people come up to is like, okay, now what?

Kostas Pardalis 41:48

Ben, quick question about MLOps, the word comes from ML and Ops. Okay. And the reason that I’m saying that is because usually with ML we are associating data scientists and engineers, right? How do you see the overlap between data engineers and like these other engineering roles, participating in ML options, whose problem is MLOps at the end?

Ben Rogojan 42:12

That’s definitely kind of an interesting question. Because like, right, like you have ML engineers that develop models and put them out in systems. And I think they’ve definitely been doing that for a while like larger tech companies. But I think I’ve also seen a lot of people kind of implement machine learning models in things like Airflow, right? Like, if you’re doing like a batch job for your machine learning model, it’s like, Okay, well, what’s one way we could kick this off? Well, let’s just use Airflow. Right? Like that’s, that’s one thing we could do to get out whatever the output is. And obviously, that’s not for live models. But if you’re doing something that’s batch focused, so I think that’s kind of the similarity where you kind of have the same thing where you’re dealing with either batch jobs in kind of ML, or you’re dealing with live more streaming like jobs, and you probably come up with similar optimization problems and like performance problems as well, that you would run into data engineering, when you’re doing transformations and things of that nature. So that’s kind of why I think they’re somewhat related. Whose problem it is, I think, will depend on tooling in the future, I think for now, like, you’ll still have ML engineers kind of taking care of a lot of the implementation or software engineers, in general, just because again, you’re trying to optimize something that maybe you might not have someone that’s both strong in ML and in implementation, so you might have to kind of find this happy medium. But once I think you start getting hopefully better tooling, you can get to a point where maybe ML engineers or machine learning, like researchers and data scientists can figure out a way that they can deploy it, maybe without having to have a whole extra person required for that. Well, I mean, again, that’ll depend on where tooling goes, though.

Kostas Pardalis 43:38

That’s great. I mean, okay, it’s probably also like a little bit early. It’s still out of the definition, and like trying to figure out exactly how it fits in the organization. But can you give us a list of tools that you think are core or like important for MLOps today that are very commonly used?

Ben Rogojan 43:56

I think there’s a few that I’ve been like, looking into myself, I think, like, I’ve personally been looking into like, things like just Azure ML and the different features that it has, because like, obviously, it’s got some things that are more focused on helping you kind of find the right model. But it also is, I think, going towards that drag and drop very similar to SSIS feel tool. I don’t know what’s called where you can kind of run models, using things of that nature. I think also, things like feature stores tend to be something that can play a role in ML Ops. Also, I’m looking at DataRobot right now in terms of how it’s going to kind of play its role. So I don’t know if there are specific components. I think there’s like specific tools I’m looking at to try to figure out what their role is, what works best where, right. Like, I think that you’re going to deal with a similar problem that we have in the data engineering space, which is there’s just a lot so I’m still trying to figure out which tools will work best.

Kostas Pardalis 44:50

Yeah. Yeah, actually, that’s a very interesting topic–feature stores. The reason I find feature stores very interesting. From a product perspective more, it’s not like the ML or engineering perspective, is that you hear a lot about them. But actually, there aren’t that many of them out there. I mean, you have the big companies that they have built their own. And even companies that traditionally open source many things like Netflix, for example, they haven’t done it yet. And you pretty much have something like Tecton, which is, okay, something that at least until recently, it wasn’t like publicly open, it was like a very enterprise kind of product. And Feast which is open source. And that’s all. I mean, is there anything else out there?

Ben Rogojan 45:40

I think I recall doing something a while back where maybe I saw one or two more, but like, those are definitely the two that I recall. Tecton’s, one of the Slack channels I’m on, so yeah, those are the two that I’m well aware of. But I’m sure there might be some more, maybe more open source style projects out there. But yeah, it really has been kind of, you know, people haven’t really tried to productionize it and make it into a product. Are you thinking about that as your next product, Kostas?

Kostas Pardalis 46:10

I don’t know. I mean, it’s a very interesting data problem that feature stores are trying to solve. We had an episode with someone from Tecton, actually, about feature stores and it was very interesting. It was like the first time that I talked with someone about feature stores and it was very fascinating. But I find it very interesting that we don’t see that many products yet. That’s one. And the other is that we don’t see open source projects, which is another thing, like for example, let’s say, let’s take data lakes, right, we have Delta Lake, which has been open source, we have Iceberg, we have Hudi, you can do your own things, probably with more of like vanilla stuff by just using something like Parquet files and run like Athena or something like that. But I would put that closer to the products that are related to data warehouses and data lakes, right. But you have like, we have quite a few open source projects there. But that’s not the case with feature stores, which I find very interesting. Maybe it has to do with the nature of the problem, or the products or like the scale. That’s another thing. Like what kind of company you believe needs a feature store. And when does it become something important?

Ben Rogojan 47:27

I don’t think I have a good answer for that at this point. I just don’t think I have a good answer.

Kostas Pardalis 47:30

Because I had this conversation with a guy from Tecton. And he was saying that, like feature stores is something that you need that is going to affect the productivity of your team, right? Like you need to have a sizable team to need that. It’s not something that just because you have someone who’s creating a model or to or trying to do some sort of prediction internally, you’re going to need a feature store. Maybe that’s also another reason like maybe it’s a product that you need to have a certain scale and above like to actually need it, or it’s just too early. And it’s this whole product category. A lot of definitions. I don’t know. But it’s very fascinating. It’s very interesting. I’m very interested to observe how feature stores are going to progress as products. I guess like in Facebook, you have similar technologies that you are using, but these are all built internally, right?

Ben Rogojan 48:19

Yeah, yeah. I mean, I think a lot of the stuff that’s like ML Ops, it’s funny, like now I take for granted, I take for granted a lot of things. I think at this point, if when you talk about my work internally at Facebook, just because it is the positive and negative I think of working at a big tech company, right? Like, you don’t understand all the problems. So before, in order to make my way through college, I worked in the culinary field. And the first restaurant I worked at was like one of the top restaurants in the city. And like eventually I went to a slightly lesser one. And they kind of pointed out they’re like, yeah, like you used to work somewhere where you just basically had to slice a tomato and serve it because you got such good ingredients. And now you work hard to make those ingredients something. So same here. It’s like some places you just start with such a good place, it’s like that’s just so easy. Like, I mean, like when you have a lot of the harder problems solved. It’s not to say there aren’t problems. It’s just different.

Kostas Pardalis 49:18

Yeah, makes total sense. All right, one last question from me. What do you think is the importance of open source in this whole category of data related technologies that we see around us?

Ben Rogojan 49:30

Yeah, I mean, I think definitely open source will always play a role, right? There’s so many, there’s so many things we already kind of rely on in one way or the other that are open source. Like I think even things like Hive, although it started Facebook, went open source, like it just benefits a lot from people being able to improve the overall solution and not being limited to the 10 engineers that could possibly be working on it. And I think I think it just gives a lot more perspective to the problems that you run into in that code base, right like you’re not being forced to wait for someone to fix a problem, you can fix the problem. And I think, especially as engineers, that tends to be our mindset anyways, right? Like, just give us the code, right? Like, we’ll fix it, like, just give me the code. I’ll fix it. And then we can go forward with this and make this better together. So I think that’s kind of the important thing in terms of like, why the benefits of open source, right, like we can we can, in theory, move faster if you have a good community around your product.

Kostas Pardalis 50:26

Yeah. You mentioned Airflow a few times which is an open source project. Are there any other projects that you love that are open source?

Ben Rogojan 50:38

I don’t know if I’d say that I have huge ones I love. Like I keep tabs of things, right? Like AirBytes is something I’ve been keeping tabs of. It’s like an interesting idea in terms of open sourcing data connectors. Yeah, I think that’s the other one that I’ve kind of currently been paying attention to. Yeah, I think that’s currently my focus. I don’t think I have a love for anything. I think if something’s open source, if something costs money. I think it just depends on the tool. Like if I like it, I like it. Snowflake. Like Snowflake is not a cheap thing. But I like it. So I would enjoy using it. But yeah, I don’t think there’s a preference.

Eric Dodds 51:13

Really an interesting conversation. I loved hearing about your experience as a consultant; loved hearing about Facebook. And then of course, the other side of the stack, which is a whole fascinating conversation in and of itself. And something I think we should do an episode on probably here soon, Kostas. Because I agree. It’s the next wave of what’s gonna happen to data once everyone kind of gets the analytics and the data unification sorted out. So, Ben really appreciated the conversation. We’d love to have you back on the show sometime soon.

Ben Rogojan 51:44

Yeah. Thank you guys. I really appreciate your time. I enjoyed this conversation. Yeah, let me know.

Eric Dodds 51:50

That was a really fun show. And I’m going to be a broken record here and restate something that Ben stated, and then that I also restated, but it’s amazing to me, just thinking about the fact that they have more engineers working on internal tooling for engineers than many companies have total employees when they IPO. And that’s just incredible to me, just thinking about having those levels of resources and the types of things that you can build, the speed with which you can build them. And of course, working in a large organization, there’s, you know, process and bureaucracy, but that kind of leverage is pretty mind blowing.

Kostas Pardalis 52:28

Absolutely. I don’t think that we can understand the scale of a tech company like Facebook, and I’m not talking about the technology, like forget about technology, I think the most fascinating thing is the organizational scale. How can you get all these people and like all these thousands of engineers, and create such a consistent product experience at the end internally and externally? It’s amazing. And I don’t think that it’s something that you can easily experience. It was a very interesting and very fun episode. I also want to, outside of what you said, there are two things that I want to keep from our conversation. One is that the problems at the end are the same, regardless of how big or small of a company you are, what changes is the scale, actually. And that might change like the tools that you might be using. That’s like an interesting part of the conversation where we were saying that’s okay, just use Postgres at the end. You don’t really need a huge data warehouse that is super, ultra scalable, like Snowflake, right? That’s one thing. The other thing that I really liked was what Ben said about Snowflake that it’s the Apple of data warehouses. I loved that. I think he’s very to the point about the product experience that people get from Snowflake. So that was also amazing. And yeah, hopefully, we will have him again, for another episode soon.

Eric Dodds 53:50

I really enjoyed hearing that. Because I think especially in the world of data, we have a requirement to be very precise in our work and be very descriptive, requiring very specific features. And there’s this intangible component of really great products that make people say, I just like to use it. And that’s kind of hard to describe. And I love that he brought that up and said it’s expensive, but I just really like it. And I think that’s a big testament to Snowflake and what they built.

Kostas Pardalis 54:22

That’s true. That’s true.

Eric Dodds 54:24

All right. Well, thanks for joining us, and we will catch you on the next one.

Eric Dodds 54:30

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me Eric Dodds at Eric@datastackshow.com. This show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 49:

MLops – The Finalization of the Data Stack with Ben Rogojan of Facebook

August 18, 2021

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter