Episode 30:

The DataStack Journey with Rachel Bradley-Haas and Alex Dovenmuehle of Big Time Data

March 24, 2021

On this week’s episode of The Data Stack Show, Eric and Kostas are joined by the co-founders of Big Time Data, Rachel Bradley-Haas and Alex Dovenmuehle, formerly of Mattermost and prior to that, Heroku. At Big Time Data, they work together to provide companies with the ability to derive value and insights from decentralized datasets, improve business processes through data enrichment and automation, and build a scalable foundation to enable a data-driven culture.

Notes:

Highlights from this week’s episode include:

Rachel and Alex’s background and their goal to make data approachable for companies everywhere (3:09)
The data stack journey: making decisions when you’re small that allow you to grow with your data and your organization (12:28)
The problems faced when a data stack isn’t nurtured early on (15:59)
Changes in data stack technology (21:32)
How Alex and Rachel’s roles at Big Time Data differ and interact with each other (39:00)
Client use cases (43:34)
Comparing the stacks of seed stage startups, mid-sized companies and giant enterprises (48:54)

The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:06

The Data Stack Show is brought to you by RudderStack, the complete customer data pipeline solution. Thanks for joining the show today.

Eric Dodds 00:16

Welcome back to The Data Stack Show. Eric Dodds and Kostas Pardalis here. We have an exciting guest and a surprise guest on today’s show. The exciting guest is Alex, who was at Mattermost, who was actually our very first guest on the show exactly 30 episodes ago. Alex has since started a consultancy called Big Time Data, along with Rachel and both of them are gonna join us on the show today to talk about the data stack journey, how the data stack changes over time. I’m very interested … my burning question for them is … we see this all the time in our work and from people on the show, companies of different sizes have different requirements around the stack that they build for customer data infrastructure. And I want to know which tools stay the same throughout the entire journey from you know, sort of person in a garage startup, all the way up through enterprise level. So that’s what I’m going to ask. Kostas, what’s on your mind?

Kostas Pardalis 01:13

Actually, we’re very aligned on that. That’s something that I think both Alex and Rachel are the perfect people to chat about how from this very chaotic market of data related technologies right now others emerge, and what these patterns are and how people can use them to navigate the whole process of building their own data stack. And there are of course, differences between the different types of companies, the different problems that they’re trying to solve and scalability. But there are also many commonalities. So I think we will be able to tackle this today. And of course, I’m even more excited because Alex was the first episode of this show, but also my very first episode for a podcast ever. So I’m really, really happy to chat with him again.

Eric Dodds 01:53

Great. Well, let’s talk with Alex and Rachel.

Eric Dodds 01:57

Alright, we have our very first podcast guest ever, from the first episode back on the show, and have added another special guest. So Alex, who was at Mattermost, when we talked with him last and Rachel, who was also at Mattermost at the same time, have joined us to talk about the data stack journey. Thank you so much for joining us.

Alex Dovenmuehle 02:20

Oh, you’re welcome. Glad to be here again.

Rachel Bradley-Haas 02:24

Glad to be here for the first time.

Eric Dodds 02:26

Well, we had a great episode with you, Alex, talking about all sorts of interesting things. I guess it was, wow, six or eight months ago now.

Rachel Bradley-Haas 02:35

Time flies when you’re stuck in a house, right?

Eric Dodds 02:37

Yeah. Yes, indeed. Yes, indeed. Well, why don’t we start out, we’d love to just get a little bit of background. So you had different roles at Mattermost but worked very closely together. But would just have a little personal background on your history, how you ended up at Mattermost. And then what you’re doing today, which is Big Time Data, which we want to hear about as well. So, Rachel, why don’t you start, and Alex, of course, we want our new listeners to hear your story. But we’ll let Rachel start.

Rachel Bradley-Haas 03:09

Yeah, yeah. So my background is in industrial engineering at University of Michigan. Rep that till the day I die. So go blue. And really what ended up happening, graduated, went to Cisco, really, honestly, very lazy in the way that I never liked to do the same thing twice. So got really into automation data, how do you scale all that, and then wanting to go deeper on a technical level. So I ended up moving over to Heroku, which at the time was a subsidiary of Salesforce, and did a lot of data analytics, data engineering. I ended up spending a little bit more time on the operation. So my role grew into really understanding how all the data in the data stack can be used to drive go-to-market motion, automation, and scalability. And then, after that, I kind of felt I had outgrown my role there and decided to take a risk and go to a smaller company with Alex. So we ended up going over to Mattermost. And really starting from there just understanding how do we start a data infrastructure from scratch, basically, using some open source technology, some new tools we had never used before. And then also, how do you help an organization adopt a data driven culture and really embed that in their day to day, so that’s, that’s where we’re at at this point. And then, you know, once Alex is done, we’ll talk a little bit about how Big Data came about.

Alex Dovenmuehle 04:30

Just as far as my origin story in this whole thing. I come from, you know, computer science, full stack developer kind of background. It was really at Heroku that I got into all the data engineering things and basically modernized their data stack, which at the time when we were there early, this was six years ago, five years ago, you know, they were using batch scripts to run stuff. And it was just a total nightmare. They were using Postgres as their data warehouse. So we migrated all that–DBT, Airflow, Redshift. And that’s where Rachel and I first met it at Heroku. And, you know, she was doing the analytics and operations stuff. And it ended up being, at one point, just basically, the two of us, were tackling all this stuff by ourselves. And we ended up you know, building teams there and everything that. So then, when we moved to Mattermost, she said, it was, they really had no infrastructure at all. So we built that all up from scratch. And then, as far as the evolution of Mattermost, it’s, we built all this stuff at Mattermost and we could see and show the value of all the, you know,, the architecture that we’re using, and technologies we were using and how we’re using it. And then what we started to notice was, you know, there’s all these other companies that have the same problems, right, they all have a bunch of data, they don’t know what to do with it, or how to get the value out of it. And so that was really what started to really spark the idea of Big Time Data and us kind of going out on our own and, you know, actually spinning up a consulting company, because, what we really want to get to is building this for a bunch of different companies,, I want every company to succeed, right? I just want all of them to be able to harness the power of their data, and make their company the best it can be.

Rachel Bradley-Haas 06:27

Yeah, just to add to that, I feel one of the things that’s really hard as people talk a lot about data, and how to build these state of the art data stacks. And, you know, for us, it feels very approachable, right? Because we live in that every single day. And just thank, goodness, our parents were smart, and therefore we became smart as well. So it seems very intuitive. But when we went to Mattermost, I remember thinking, oh, gosh, they’re gonna know we’re frauds. We’re really not as great as they keep saying we are sort of thing. We got in there and just the smallest things that we would do, where we would say, Oh, yeah, you’re just gonna do this and throw this on top of it. And it’s straightforward. And they were just thinking, oh, my gosh, you’re a godsend. Like, this is amazing. I never would have done this. And you’re kind of thinking, huh, that’s weird. That’s just something I thought everyone knew. And so as we continued along and talked a lot about how do you scale your operations using scripting and, you know, how do you really support self service analytics and data governance, we started realizing these are things that are not talked enough about, or there’s almost a sense of, I’m too embarrassed to ask because it seems everyone knows what they’re doing.

Rachel Bradley-Haas 07:34

And so from our perspective, it’s, we want everyone to be able to do that. We want to put documentation out there, we want to have best practices, we want to make sure that people can do these things, because data is so important. And so that’s really where my passion has come from from this. It’s so many easy, small conversations that help people build confidence to make those risks. And so, you know, that’s one of the reasons why we’re on this podcast right now is just making sure people know, all you have to do is take one step at a time towards your future goal, and it really is approachable if you have the right people.

Eric Dodds 08:07

Sure, you know, it’s really interesting, Rachel that you mentioned, people being afraid to, to ask questions, because they think that everyone has it figured out. I was in consulting, before joining RudderStack doing similar things, but more on sort of the martech side. And it was so interesting. There’s almost an imposter syndrome type dynamic in many companies where you just have this sense that we’re the only company whose sales force is really messed up, and who’s having trouble cleaning our data and getting insights. And the more companies that you talk to, the more you realize, literally every company has the same problems, right? It’s pervasive, and it’s not because people aren’t working hard or they aren’t smart. But technology is changing quickly. And when you have a quick growing company, it’s just really hard to align both the organization and the tools and the data and everything to make it work out correctly, especially if you don’t have a playbook. So that really resonates with me, because I saw that all the time. Like we just, you know, it’s almost like I’m embarrassed about the state of our situation.

Rachel Bradley-Haas 09:20

Yeah. And I have two things to add to that. I think it’s one of those things where you end up finding, this is more of a psychology thing, I feel that people end up talking more about the parts that they’re comfortable with, right? And so you all of a sudden have companies that are doing one thing right, but everything else is kind of crap. And they’re talking about that one really great part, but you’re comparing it across your entire system. And so all of a sudden, you have this perception that everyone else has everything great, when in reality, it’s just that one part that they’re talking about and man Alex knows, he can get on a call and go so in depth with all these different tools that I’m sitting there just nodding, pretending you know, letting my imposter syndrome get to me but then I realize one of the reasons why Alex and I are such a great partnership is because we don’t need to know everything ourselves. You know, we obviously have great friends over at RudderStack, we have great friends at DBT across the board everywhere we’ve been. But that’s why it’s so great to have a community. I know a lot about go-to-market motions and using data to drive that. That’s something that Alex isn’t as strong about. So it’s just one of those things, you know, don’t be too hard on yourself if you’re not there yet and be realistic about what’s really going on. And man about the sales force thing, 100%. We worked at Heroku, which was part of Salesforce, and we were struggling to do it right. So I definitely, definitely get that one.

Eric Dodds 10:39

Yeah, I mean, this is a little bit tongue in cheek, but as I mean, it really is the reality. But we used to joke, we used to ask people, have you ever seen a sales force that wasn’t a mess?

Rachel Bradley-Haas 10:51

Yeah, when you spin them up. *Laughter*

Eric Dodds 10:56

That’s so good. That is so good. Okay, well, we have so much to talk about. And I know Kostas has many questions. So I’ll kick off with the first question on our topic of the data stack journey. So we wanted to have both of you on the show because one you bring an interesting perspective of working together at multiple organizations on the data stack, sort of from two different directions, you know, sort of the data engineering perspective, and heavy technical side from Alex’s side, and then the ops, sort of go-to-market alignment on your side, Rachel. And going from Heroku, which is a huge, I mean part of Salesforce, right, massive company, to Mattermost and now having consulted with a variety of organizations, you present a really interesting perspective on the best practices for building a data stack that will scale and how that needs to change over the life of an organization, right? Because when you’re just starting out, and you maybe have a two-person company, your needs around the data stack are very, very different than you know, when you get to the size of a Heroku that’s running inside of, you know, massive enterprise Salesforce. So I’d love to get the perspective from both of you on what is the data stack journey, just give us an overview of, you know, from the perspective of a company just starting out to becoming a large enterprise. What does the data stack journey look ? How would you define it?

Kostas Pardalis 12:28

Yeah, so the data stack journey to me is, how do you build your data infrastructure in a way that can grow with your company as it’s growing, and still give you all the value that you need, while being efficient with costs, and, operational burden, and that kind of thing. Like you said, if you’re a two-person company, you know, having a bunch of different tools, and you know, a bunch of different infrastructure that you’re having to maintain, it’s just gonna waste your time when you should be, you know, talking to customers or whatever. But on the other hand, it’s when once you get to that Heroku size, you know, you need to, you can really dig in to optimizations that otherwise they’re only valuable because you’re judging them. So, over and over and over again. And so, you know, this idea of the data stack journey is, how do you make those decisions up front, when you’re small, that allow you to grow with your data and your organization, and not shoot yourself in the foot where you’re having to, you know, spend a bunch of time doing rework, or, you know, your analysts are just fighting data fires, and they can’t figure out why the data’s wrong, and all that kind of stuff.

Rachel Bradley-Haas 13:47

Yeah. Just to add to that, I think there’s a couple different variables that come in when talking about that, you know, you’re talking about, are you willing to pay more to have more scalability because of limited bandwidth, right? And so you’re saying no, if I have two people, and I have one tool that does it all, and maybe it’s $1,000 more a month, versus, you know, five different tools? If you start to think, Okay, how much time is it gonna take to move between us if there’s an error or something needs to be debugged? How much longer is going to take because you now have to look at five different tools. The other thing that you brought up Alex, which I think is so important, is, you know, if you think about where you are now, and where you’re going to be in a year or five years, so on, you have to think about the cost it would take to move from one to the other, right? So right now, you might say, Oh, it’s $1,000 more a month. It’s not worth it. But one year from now, if to re-engineer is going to take an entire engineer’s month, is that more than $12,000? So you start to have, in my mind, from the operations perspective, I start to think about the dollar amount and the cost of an engineer’s time and honestly the morale, right. You want to keep these people around. We all know the worst thing in the industry is losing someone when a company’s so small and they have all the knowledge, that’s, you know, that’s a huge deal breaker. You want to be using the tools that engineers want to be using or an analyst. So you can keep them around and retain them because the loss of an engineer or an analyst is unimaginable. And I would say, you know, close to $200,000 or $500,000, depending on where you’re at.

Eric Dodds 15:22

Absolutely. What do you think … and I’ll ask one more question here and then let Kostas jump in. And this may sound a kind of an obvious question, because we, we just see this so often, right, with a growing company, and you have good intentions, and then, you know, you just don’t seem to have the time or the resources to do things the right way from the beginning. Why do you think that happens? What are the main, you know, maybe top two or three things that produce the downstream problems that companies face, if they aren’t really careful about nurturing their data stack early on?

Kostas Pardalis 15:59

I mean, I think the first thing is going to be that, as an organization, you’re not going to have that muscle of, hey, when we implement this new feature in the product, we need to, you know,, track its usage in a decent way, right? So that we can have the insights,, are they doing this thing, right? And, you know, so then what’s going to end up happening is, you’re going to kind of end up being Mattermost, where it’s, they had sort of a data warehouse, and they were sort of running some queries on it. But it was very,, the data quality was kind of low. Things were all one offs, and it just wasn’t scalable at all. And so then you have to, really go through that whole migration process to, it’s not only a technology change, it becomes a people change and organizational change, and as a growing company, you’re already dealing with so many challenges from, that growth, just in general, that having to deal with, data growth, and, you know, all that stuff, just adds to it, right? So it’s better to, if you can, it doesn’t even have to be crazy, you know, amounts of time that you’re spending on all this stuff, right? Like, you can just do a few things. And I think, you know, the more I’ve been thinking about it, it’s, can we as Big Time Data, provide some tools … and this kind of going back to the imposter syndrome thing is, can we provide some guidance and tools and guides or something that can give people the confidence that, Hey, I’m not totally screwing this up, even though I don’t know everything about it. Like, I’m not an expert, but I know I need to do something.

Rachel Bradley-Haas 17:44

And the other thing that I’d add on top of that, obviously, you know, Alex, you and I always think about these questions in different perspectives. But I think that’s great. From my perspective, I think the biggest impact is if you have all these brilliant people that need to be focusing on strategy and making sure that that business is successful, you know, you’re going to hit that pivotal moment, and are you going to be ready to blast off? Or are you just going to be a dud. And if you have the leaders of your organization, spending their time questioning numbers, instead of focusing on strategy, that’s a big deal, you start to have a VP of Marketing, presenting numbers about, you know, your pipeline, and there’s a disagreement, all of a sudden, you’re spending the full day trying to get ready for a board meeting and questioning how many MQLs you have, instead of saying, how are we going to present this? What are our next steps? What are we doing for the next year? How is our product going to change as our customer evolves? Right? Those are more important questions, then, what’s the definition of an MQL? Is our data from salesforce coming in accurately? Do we have the right you know, triggers in our product to promote, you know, growth, all these different things? Right. So I think it’s so important that you have data in the right place in the definitions and it’s trusted or else, you end up spending these unaccounted for hours trying to figure those things out, and no one tracks that anywhere. It’s just something that comes as part of the job. And I think as soon as you realize that, you wouldn’t have to have as many of those conversations. If you had invested a little bit upstream. You’re gonna regret not having done it already.

Eric Dodds 19:15

Absolutely. The board meeting scramble. That is probably a good topic. That’d be great to collect war stories, because you said that and I think myself and probably a lot of our listeners know exactly what you’re talking about.

Rachel Bradley-Haas 19:30

I have one thing to add to that. I’ll just say, you know, this goes into the whole, I’ll give a little shout out to Michael Schiff. He’s been a mentor and my boss and Alex’s boss for a while, you know, “you pay now or you pay later”, is one thing he always said. The other thing is you’re training them or they’re training you, and I will tell you that at Mattermost. We’ve worked very closely with Aneal, who’s our VP of Finance there. And we have trained him how to go and get his own numbers. How to trust the data for all of the things that he needs to present to the board, and I’ll say, the last board meeting, there was only one question that he had that he reached out to me for getting for the board meeting and the ability for him to self serve. When initially it was, you know, four to six hour calls, trying to get him numbers has just been amazing. It just kind of shows as you train them and as your data stack evolves, people are able to trust the data and feel more confident going and getting it themselves.

Eric Dodds 20:26

Love it. That is really cool. All right, Kostas. I’ve been monopolizing. And I can keep going, but I’m not going to because I know you have a ton of questions.

Kostas Pardalis 20:36

Yeah, Eric, I think this is a common pattern lately on our shows but it’s fine. I mean, you’re asking very good questions anyway. So it’s good. For me, it’s a very special episode today. Because you know, Alex was my first ever guest in the podcast episode. So I’m super happy and excited to have him back and also having Rachel together, because they are both working on the data stack, but they see it from a different perspective. So I think it’s a great opportunity to have both perspectives at the same time. So let me start with a question about the data stack. I mean, you’ve been working with data for quite a while. And you have seen the changes that have happened in the technology. So how the data stack has matured since your time at Heroku? Or even earlier, if you have experience from before that. And what are the tools that really excite you that exist today that didn’t exist in the past?

Alex Dovenmuehle 21:32

Yeah. Yeah, it’s crazy how much things have changed and it feels it hasn’t been that long. And yeah, I mean, you know, going back to Heroku, the early days, I mean, you’re talking about I already mentioned the batch scripts and stuff. But, you know, the SQL that we were writing, I mean, literally, and I’m, not kidding, 1,000 line SQL files were not uncommon.

Rachel Bradley-Haas 21:58

I don’t know why you’re complaining, I really enjoyed debugging those scripts.

Kostas Pardalis 22:02

Yeah, we had just amazing data quality, too.

Alex Dovenmuehle 22:05

And so one of those tools that I just preach the gospel of everywhere I go is DBT, which we started using at Heroku three years ago, or four, anyway, something that. And that was really, it was funny, because it was actually a data engineer on my team who just sent me this link. He was, Hey, I saw this thing on Hacker News. And then we started looking at it, Oh, my gosh, we have to use this, what are we doing? And that really, so then you go from here’s this 1,000 line SQL file that I can’t make heads or tails of. I mean, eventually, I could, but you know, you have to, it’s every time you have to debug the thing, it takes you four hours to remeber all the nooks and crannies of the stupid thing. You go from that to, Oh, I just have, you know, a 50-line DBT model, and then a couple other ones, and everything just works. It’s amazing. So that’s one, I think the other thing that has been really interesting, is just the availability of tools that make dealing with large amounts of data easy for, you don’t have to be a PhD person to be able to deal with big data anymore. And I think there’s just been so much done there that it really helps, I mean, anybody, right? Like anybody can deal with terabytes of data now. Whereas before it was, Oh, my gosh, I have terabytes of data. Like, it’s gonna take me hours and hours to query any of this stuff. And I don’t know what to do. So I’ll let Rachel add to that in her way.

Rachel Bradley-Haas 23:49

Yeah, I mean, one thing that you didn’t call out is the biggest thing that we ended up changing right away, when you took over our data stack at Heroku was adding Airflow. We used to have everything be shipped basically in one massive daily or hourly job. And it would be 50 in an hourly job. And you know, 120 in the daily job, they would lap themselves. It was utter chaos. One thing fails, you have to kick it off by itself and have to track it to make sure it finished. It was terrible. And so just getting Airflow and all of that going was a huge game changer for us.

Rachel Bradley-Haas 24:23

And then, Alex mentioned, we started talking about DBT. And from my perspective, I don’t think I conceptually understood what it was doing at first, I viewed it as cool, it’s a different way to organize your code, yada yada yada, huge investment. This just feels a tool that an engineer wants to use because they’re bored of their day to day job and they want to have a new tool to mess around with. I was so wrong. I think I was very busy at the time and I didn’t really take the time to really understand what it was. And when we ended up moving over to Mattermost, because at that time I had a team for, was very heads down more on the analytic side, when we moved over to Mattermost and we started it from scratch. And we had basically no code and you’re starting from scratch, seeing how great it was to build these dependencies on top of each other and have it be so clean, where if you just need to change one small piece of logic at the granular level, it, you know, scales and moves throughout your entire, basically data model. And so being able to see that, I’m so glad that they still invested in Heroku, even though I wasn’t a huge proponent of it, was a good idea. And I think it’s one of those things where, once again, you pay or you pay later, you’re going to have technical debt. And I think DBT really helps you manage that, in a way. It really limits how complex your technical debt will get; so big fan of DBT as well. And then you know, I’m just a huge Looker fan girl, I can’t help it. That’s always been something I’ve been very lucky since we went to Heroku. Heroku had it from day one, when I did my interview, I did stuff in Looker. I don’t think I ever want to live in a world that doesn’t have Looker available for it. I just think in terms of how they’ve turned a visualization tool more into a data governance tool as well that allows self service scalability has been a game changer in terms of making sure analysts can focus on the important things and not become report monkeys, right? That’s everyone’s biggest fear being an analyst is do they think I’m a report monkey? Or do I really get to drive change in the business? And so I think Looker has enabled analysts to focus more on driving change diving into the data, because you now do have people VP of finance, feeling comfortable going and looking through Looker, and pulling data themselves with confidence.

Kostas Pardalis 26:44

Yeah, that’s a great point, Rachel, I think and I’ve said in the past that the most successful tools of the end, they don’t just add value or simplify processes, they actually promote organizational change. And that’s a very good point about Looker. And I’m happy to hear that from you. So from what I understand, I mean, some major changes that happened in the space are things orchestration talked about, about modeling, composability in SQL, something that was missing from the language for a long time. So my feeling is that there are many of standard, let’s say DevOps, or software engineering techniques that software engineers are using, for quite a long time, that are entering this space. And that’s a sign of maturity. And, and it’s their way of cooking, let’s adopt methodologies, techniques and technologies that they have proved to be too, they have proved to add a lot in the productivity and the way we work. What else do you think that is going to be introduced? I mean, there are things like CICD, there are things testing, especially, I mean, DBT is doing a lot of things around that. But I think it’s still in the mature sides of the data stack. So what do you think is going to be the next big thing? I would say that it’s going to be introduced to the data stack, that’s going to have a lot of impact in the everyday work of someone who is managing and building these data stacks.

Kostas Pardalis 28:08

From my side, I think there’s two things, and one you touched on, which is testing, which really is more about data quality, right. And, you know, you see things with great expectations, you know, DBT has some testing stuff built in, and they even just came out with a great expectations package that you can use. And I really think that’s going to be, you know, you said, bringing actual, software engineering techniques, CICD you want to have, unit tests, that kind of thing. It’s been really hard to do that in your data warehouse. And so then you end up in that situation where, you know, your VP of finance comes to you and is, this number doesn’t make sense that I’m seeing. What’s the deal here? And then you’re having to trawl through all your data, trying to figure out what the issue is, right. So I think that’s going to be one thing that’s going to really take off and it should take off. Like that should be part of your, you know, that should just be the way that data warehouse and data engineering is done. It’s, okay, I’ve developed my model, but now I need to test it and make sure that it’s the way it is. And then the second thing that I think is interesting, and I want to learn more about it and get more into it is getting to more real time analytics, not only analytics, but also doing stuff with all of the data that you have in your data warehouse that triggers in real time, something to happen, whether it be marketing or something in the product. Things that, I think could really be interesting,, you know, you look at Materialize, where they can basically ingest all this data from a Kafka stream and you can write a SQL statement on top of it that you know, updates basically in real time. It’s, what if instead of your DBT models, you know, you have to run them incrementally every hour, or whatever it happens to be. What if they just always were up to date? What if that just automatically happened? I think that’d be really cool. And that’s something that I’m keeping an eye on and trying to learn more about and see what value that we can get out of technologies, because I think that’s where people are going to start really looking for stuff.

Rachel Bradley-Haas 30:31

Yeah, a few things come to mind for me. I think one of the things when I think about data, and it’s been great that there’s a huge growth of companies that are really focused on the data engineer, right, I feel for a while it was kind of, well they just do what we tell them to do and make the data happen, we’re not really going to invest in tools for them, and whatnot. And now I think there’s huge importance on it, which has been great. So that’s why you see some of these companies coming out of nowhere with a bunch of stuff to support them, and really making sure they have what they need. But with more tools comes more issues around integrations and timing, and you start to think about, okay, well, I’m piping in my data with a Stitch or a Fivetran, then I have to run my DBT jobs, and then I have to send that data somewhere else. And if you don’t have really great scheduling, or orchestration, you’re all of a sudden sending stale data out because your DBT job took too long to run. It’s not timed up perfectly. And so you start dealing with, how do you make sure that everything’s kind of talking to each other so that it is going based on dependencies and all that. And then the other thing is tool consolidation, because I do start to worry about how much of the data stack is going to be very piecemeal. And if something goes sideways, you know, debugging that many tools can be very difficult. And so are you going to start seeing companies have more integrations with each other? And you know, talk to each other? You think about the Salesforce idea where you have these different packages and installations and whatnot. Are you going to see, you know, connections between a Stitch and a DBT, or a RudderStack and a DBT? And then are you going to see DBT have connections to another tool that then is going to write to Salesforce, all these different things, it feels there aren’t as many really strong integrations there yet. So while it might not be a tool itself or a product, it’s how do you make sure that all these dispersed tools are talking to each other and have really great alignment? Because if there’s any gaps in that system, the data engineer and analytics and honestly, business as a whole will suffer.

Kostas Pardalis 32:36

These are some great points. And actually, it’s something that I’m also thinking about lately. I mean, I totally agree with you, Rachel, I think that the way that it works right now, with all the different tools, and just adding more and more tools. Like for example, I think it’s a very common pattern to see companies using both Stitch data, for example, and also Fivetran, just because there are different needs for integration, or they are trying to control their costs and all these things. But it’s good to have many options out there. But the downside of this is that you end up having a stack that it’s more fragile, right, much more difficult to figure out where the problem is, and especially when we are talking about tools that are cloud based, right, it makes the whole process and trying to debug much more time consuming and much harder, in my opinion.

Kostas Pardalis 33:28

But what I find more interesting, and I would like to hear both of your opinions, we are talking about data stacks, where the core of the data stack is the data warehouse, right? It acts as the, the central repository of all the data that we have. And this is of course, a great architecture, and it works really well. And that’s why companies are adopting this. But as we add more and more tools that they have to interact with, and especially when we are talking about real time, right? The utilization of the data warehouse is probably going up, right. And one of the selling points of data warehouses, Snowflake, or BigQuery, is that you can control your costs, because you pay as you go, right? Like you have to execute the query, and then you’re going to pay but we are reaching a point where I don’t see or I don’t feel that the data warehouse is going to be sleeping a lot. So at the end, we might end up in a situation where the data warehouse, it’s just working 24/7, and optimizing the cost around that from my experience, at least, is not the easiest thing to do.

Kostas Pardalis 34:36

So two questions here. I mean, first of all, I’d like to hear your opinion on that if you agree with this. But how do you think the data stack is going to evolve to address these things, especially with the position of the data warehouse? And how do you think that the data warehouses Snowflake or BigQuery, or even Redshift can adapt to these new data challenges? Okay, traditionally a data warehouse is not something that should provide responses in real time, right? It’s not something that should be working 24/7, naturally. And that’s how the systems were designed. But the industry has different requirements right now. So what do you think about this?

Kostas Pardalis 35:16

Yeah, so I think at Mattermost basically, we do have an extra small virtual warehouse running basically 24/7, you’re saying, and that’s just kind of been, well, we just kind of have to have that going, you know, all the time. And that’s just the way it is, I think the … and, you know, we have spent a lot of time actually optimizing our Snowflake costs at Mattermost. And, you know, it’s anything from just, warehouse optimization as far as, what jobs are you running against which warehouse, and you know, how often do you run them and all that kind of stuff to even optimizing queries, right? Because if you can take a query runtime, from, you know, 10 minutes on an extra large warehouse to five minutes on an extra small, you know, at least in Snowflake land, that’s gonna be quite a cost savings. So, you know, I think going back to kind of the Materialize idea, I think, is why I get a little bit excited about that, too, because it’s, can you use Materialize more as your real time data store that gives you that real time access? And then you know, behind the scenes, you just are doing your regular things with Snowflake, and all that.

Rachel Bradley-Haas 36:34

Real quick on that, Alex. Like, just to go back to what you’re saying. Because you know, I think Materialize is a great option, you know, as that space continues to evolve, but for right now, right, you start to think about–and tell me if I’m completely wrong, because once again, this is why I feel very lucky to have you–but we basically have that extra small warehouse running all the time, which is then dumping data, obviously, modeling this data, bringing it in, and then we have Looker going against different, more powerful warehouses, that in the moment, if someone is querying something using Looker, or whatnot, you know that we’re paying more, but that one’s not up all the time, because you’re taking care of all of your piping data in and modeling in a different warehouse, which is up a lot of the time, but they’re smaller, and then we have a bigger warehouse that maybe is running more complex stuff. But that’s only running as a user needs to access it, right.

Kostas Pardalis 37:29

Yeah, yeah, exactly. I mean, that’s where you sort of have to, I think, people who are new to Snowflake really need to understand how that pricing model works, because you can kind of rack up a lot of cost if you aren’t a little bit careful with it. The other thing I think is you can look into, you know, Snowflake has their Snow Pipe, which is a lot less money as far as getting the data into the data warehouse. And then I know BigQuery has ways for you to stream data into the warehouse as well, you know, for less cost. So, you know, I think, at the present moment, Rachel’s saying, it’s, that’s sort of where we’re at with all this stuff. And you just sort of have to play the game in that way. And I, you know, as far as moving forward, I think we’ll see more stuff,, your Snow Pipes and things that, where it’s, okay, you can optimize your costs for, sort of a subset of the use cases that you need it for.

Kostas Pardalis 38:30

That’s great. I have two more questions, and then I’ll give the microphone back to Eric. I know, he also has a lot more questions to ask. So you are working on the data stack, but on a different part of it. Right. And obviously, you have been very successfully working on that, all these years. Can you describe a little bit more how your roles differ? And how do you interact with each other?

Kostas Pardalis 39:00

Yeah, yeah, for sure. So yeah, I mean, basically, the way I’ve kind of been seeing it is, I do whatever Rachel needs me to do to, do the go-to-market things and operations and analytics things to make it work. I’m kind of the plumber, who I also considered myself as building the Legos, and then she would put them together is kind of the way I would think about it. And that’s why, I mean, honestly, that’s kind of why we, you know, started this whole thing is because we just work well together. And it really, there’s no gaps, right? Like if we were both just hardcore data engineer people, it’s, okay, yeah, there’s some cool stuff we could do. But together we’re able to really have a huge impact on organizations. And I mean, you can see just based on this conversation,, we definitely think about things in a different way, but it ends up fully forming the idea and you know the solution.

Rachel Bradley-Haas 40:04

I think the thing that’s been really great, and one of the reasons why I even feel comfortable going into business with Alex is we just have such a great level of trust with each other. I think what ends up happening is I take a lot of time to understand the needs of the business and really think through where we need to go? What are the things that the business doesn’t even know they need from the data yet? And how do we make that happen? And so what ends up happening is I go to Alex, and I say, Well, what about this? What about that? Brainstorming these moonshot ideas, and Alex is absolutely brilliant in my mind, I mean, don’t tell him because, you know, I don’t want his ego to get too big, but I think what’s really cool is we take these different tools and anything that doesn’t exist, you know, he’s able to bring a custom aspect to it. So from my perspective, I do everything from basically what I would consider analytics engineering, all the way through process flows in Salesforce, and helping marketing define how they want to do their pipeline and sales forecasting and all these different things, right. So we really do meet at that overlap of right where data engineering, hands it off to analytics engineering. And I’ve been, literally in the past two weeks, I think I have really honed in on being obsessed with the concept of analytics engineer, because I think in the past, either I was oblivious to it, or it really is that new, I don’t think that that concept is really there. I used to call it a hybrid analyst engineer. And I think that’s those people that have the ability to map business logic to raw data and model it in things DBT is where there’s going to be a lot of investment, right, we have these very strong individuals. And those are the core people that enable self service analytics. And so from that point on is where I focus, and Alex really does everything before. But the thing that’s super important that Alex does is he knows how the data needs to be ingested, and kind of initially modeled for that analytics experience. And so you end up having if you don’t have a tight interaction, and relationship between data engineering and analytics, you have people just dropping data and not giving a crap about what it looks, honestly, the quality of it, or how it’s going to be used, into the data warehouse, and then it’s so inefficient for analysts to try to query it and model it. And you know, there goes your Snowflake costs, if all of a sudden, you know, instead of writing a few different scripts before you dump it in the warehouse, you’re just dumping it in there. And then next thing, you know, you’re spending $1,000 more on Snowflake for the analysts to try to model it and create something of it. So I think, in general, that overlap there and empathy and understanding about what we want to do with the data has really allowed us to grow in a scalable way.

Kostas Pardalis 42:57

Yeah, that’s great. Some great points here. So based on your experience with Big Time Data, what are some common issues that you see with your customers? And also your prior experience with that, in the communication between data engineering and data analysts? And do you have some advice to give around that? And if you would also,,, can you tell us all how Big Time Data helped with that, because solving the technology problem is one thing, but the technology can do nothing if the organization is not right around the technology. So what are your thoughts around this? And how do you approach it as Big Time Data?

Kostas Pardalis 43:34

Yeah, so the clients that we’ve had, it’s been interesting, because most everybody is, Hey, I know, we need to have a data warehouse, so let’s just use BigQuery, or Redshift, or whatever. And they’ll have some data in there, but then they sort of,, okay, we have a data warehouse, great. But it’s, well, hold on a second there, you’re not really getting any value from this data, really, or you’re just running one-off queries on top of it. So it ends up becoming this thing where we come in, and we’re, Okay, cool. We have some data, and they’re, what are you doing with it? And they’re, well, we’re trying to figure that out. And that’s what we’ve really been helping with is, hey, let’s, let’s get in there, let’s understand your data. Let’s model it, and then let’s build a scalable analytics infrastructure on top of that. And then you know, and then you can get into even more fancy stuff, as far as, you know, marketing, automation, and all that kind of thing. So, that’s what I mean, that’s pretty much what we’ve seen a lot. And, you know, you said, it’s, an organizational change as well, because one thing that we’re really sensitive to is just the trust in the data that people need to have to use the data that you’re producing. Because if you have, you know, it’s, Okay, great, you have all these fancy graphs and stuff. But do people actually use that data? Are people actually trusting that data? And so that’s something we’re really sensitive about, is making sure that if we come into an organization, we’re not trying to just, build something, leave, and then, nobody really uses it, it’s, we really want it to be used long term and build sort of that muscle within the organization on being data-driven. And, you know, Rachel was talking about earlier with Aneal at Mattermost. It’s, you know, they’re trusting all this data, they’re using it for board meetings, and all that kind of stuff.

Rachel Bradley-Haas 45:25

Yeah, I think the one last thing I’d add to that is, I think a lot of these companies, you know, obviously, it makes sense, when you have a limited number of individuals, you have a lot of people focused on the product in engineering, and then you have kind of this slim, quickly moving, go to market area, right, you got a salesperson, a marketing person, they might have other roles in the company as well, right, especially when you’re really small. And so they don’t necessarily have the ability to hire a data engineer. And what ends up happening is no one’s taking a step back and saying,, what does this data mean, how should it be used? You have product that’s saying, I know, I should be creating a lot of data, I know that someone’s gonna want to use it, I don’t know what they care about, I’m just gonna create a ton of data and send it into a warehouse. And then you have people on the other side that maybe don’t have the technical skills, saying, I don’t know what to do with this data. I don’t know what it means. And so there’s this, awkward gap, right. And so I think what ends up happening is that the gap will continue to grow. And it makes it very hard, once again, as we talked about, can we make this data accessible to the modern person, or the common person at a company. And so if you don’t add that layer that we’ve talked about, that the analytics engineer has worked, saying, I can take a step back and say, This is the raw data. This is how it maps to what a customer is doing in our product. And this is what you should care about from a business perspective, then that gap continues to grow.

Rachel Bradley-Haas 46:56

And so I think that’s really what we’ve seen is people trying to make sense of the data, but really not knowing where to start. And so while it’s not always the first thing that you hire at a company, I do think it’s something that you should start moving, when you hire an analytics engineer up a little bit further, you know,, get that person in there, build that business logic sooner rather than later, or else you might suffer the consequences.

Kostas Pardalis 47:22

That’s great, I have many more questions, especially around your experience with Big Time Data. But I think we will need at least another episode. Which is good, I’d love to have you back. But now I have to give the stage to Eric, because he also has questions. And I think, Eric, I really monopolized the conversation here.

Eric Dodds 47:41

No, that’s great. Unfortunately, we’re coming up on time here. So I will have, I’ll just throw one more question out there to wrap it up. And of course, we would love to have you back on the show. So many great insights. And we’ve talked some about tools, but just give us a breakdown–and maybe maybe we can divide into sort of three stages of companies, but just thinking about our listeners who are probably at all different stages of companies–but give us a quick breakdown of what is the big time data stack of recommendation for maybe a sort of seed stage or, you know, seed stage series A startup to sort of mature,, maybe mid-sized company, you know, maybe 100 plus employees dealing with some serious data, multiple thousands of customers to, a gigantic enterprise, you know, Heroku that’s running, you know, maybe inside of Salesforce. What’s the ideal stack? And maybe we can approach it from the standpoint of what tools are the same across all three? And then what tools are different, you know, for each stage?

Kostas Pardalis 48:54

Yeah, so I think across all stages, of course, it’s going to be DBT. Because,, you know, that’s why it’s such a good tool to invest in early because you can,, it’s gonna pay dividends for years, you know, working with it. And it’s also if you want to use DBT Cloud, it’s super cheap. So it’s not you’re, you know, paying through the nose for it. I think that’s one thing.

Alex Dovenmuehle 49:20

I think, you know, I’m not as dogmatic about which data warehouse you pick. I know, Rachel would have a different answer. And in some cases,, depending on your size, if you’re really small,, I don’t even know if you need a Snowflake or a BigQuery. If your data is small enough, and by small enough, I mean, maybe a terabyte total, which is actually, you know, a pretty decent amount of data, you know, you could just run a Postgres database, who cares. And then obviously once you get to that sort of growth stage and you’re a little bigger and you can pay the money, you know, go with, I would say Snowflake would be my 1A and then BigQuery could be my 1B if you’re all up in the GCP world. And, you know, I think, then what you’re going to need is, you know, the ETL tool, just, you know, a Stitch or Fivetran or whatever. Stitch is pretty cheap too. So you can get away with that that way. And then you would need a,, you need to get product data into your data warehouse. So you know, a Segment or RudderStack, a tool that. And then, and then you’re going to need at some point, your reverse ETL, which I would say is more a growth stage tool. And there’s so many tools out there for reverse ETL that Rachel and I are still trying to figure out which one is the best. But there’s so many players in that space at the moment. Rachel, do you want to jump in?

Rachel Bradley-Haas 50:52

Yeah, I was just gonna say I feel, one of the things is across the board, kind of, you said, from my perspective, it’d be, you know, series A seed round, depending on your type of business, you probably don’t need a Snowflake, or BigQuery, probably just Postgres, Redshift, something that. But then we start talking about the tools that are across all of them, definitely DBT. The other one. I mean, from my opinion, I know this is the podcast, but RudderStack, the reason I would pick RudderStack for event streaming is really because that’s going to scale with you. So we talked about how much energy or effort is going to take for you to move from one product to the next, as you grow. With RudderStack, I definitely just feel it will grow with you, as you scale. Pricewise, you’re not going to be put in a corner as you start sending more events. So I do feel strongly about that one, as well.

Rachel Bradley-Haas 51:43

And then, the other thing is, I don’t know how much you really need a reverse ETL, if you’re that small because you basically have one salesperson that’s manually entering leads, and doing that stuff, right. So at that point, I think very early on, it’s maybe not necessary. But then as soon as you start having two to three different people, you have a third-party tool HubSpot or Salesforce, and you’re really wanting to make sure that there’s enriched data, based on the product usage in there. That’s when you should start really investing it. You got people Census, Polytonic, I know RudderStack’s coming out with their new stuff. I think overall, that’s a new space that we’re going to see a lot of growth in terms of how do you make this data accessible, and the places where people need it most, which is sales and marketing, and all of those things.

Rachel Bradley-Haas 52:30

I’m trying to think about if there’s other stuff that’s really missing there. I guess the last thing is data visualization. You know, Looker is not the cheapest product, I think it really helps you when you’re dealing with data governance, but when you’re really small, you could probably let DBT handle a lot of that. So I’d say when you hit series B, I would start thinking Looker. Before that you could probably deal with, you know, Metabase, I think is open source. They also have a cloud version. What are the other ones Alex, that you think from a visualization standpoint? I guess there’s Mode.

Kostas Pardalis 53:05

Yeah, there’s Mode. The thing I don’t about mode is just, you’re having to put so much SQL into it, which again, to your point about DBT, it’s, if you can basically make your mode queries really, really simple, and then have all the complexity in DBT. Then I think that allows you to scale. Plus, then if you do switch to Looker, you’re already kind of … you already have all these DBT models that are being used. And you can basically just … it makes your migration process a lot easier.

Rachel Bradley-Haas 53:35

Yeah. And when we say data governance, I think the biggest thing that we’re talking about from a visualization standpoint is there are tools that you write one-off SQL for every single visualization you want to create. And what ends up happening is we mentioned, technical debt, because if a single piece of business data changes, and you have to go and update your visualizations, are you going to want to go and update 1,000 visualizations to add that one piece of logic because you decided to write custom SQL for every single thing versus with Looker, you have your own code behind the scenes, which is called LookML where you define all your business logic, and then it just flows into that visualization. So it ends up being much more scalable. And that’s what you know, we’re talking about in terms of data governance, where it’s so much easier to scale and you can really trust because all of the logic is owned behind the scenes in a GitHub repository. And you make one minor change. It has to be PR approved, analytic sounds it, you can really trust that data as well.

Kostas Pardalis 54:40

Yeah, that’s, that’s great. That’s some great points. And actually my feeling is because I’ve been following the BI visualization space for quite a while also because back when I was at Blender, we had to partner with them a lot. I mean, this market pretty much consolidated at some point about two years ago, and you could see, that’s also when Looker was acquired, right. And my feeling is that as we are entering a new, let’s say innovation charting in the data space, and the way we interact with the data or the requirements that we have around the data is going to change, I think we will start seeing new BI or visualization tools that are going to address that. And that’s something that I’m really looking forward to seeing what’s going to happen in this space in the next couple of years. And the other thing that I would to add, based on what you said, just to summarize those files, my feeling about that, I mean, right now we are in a period in time where there is crazy hype around anything that has to do with data, there are literally, products coming out every day in every possible function around data, from governance, pipelines, there is also a big part of that has to do with ML and AI we, which we haven’t had, since it’s still quite mature. But there are even new categories that are formed right in there with things feature stores, for example.

Kostas Pardalis 56:07

So there are way too many things happening. And I think for someone who’s trying to build a new stack, it’s really easy to get lost in all these details, make the wrong choices, have overconfidence in what you can do with your data. I mean, I was in this position, right? Like, I had five customers, and I was trying to do data driven products development, which okay doesn’t make sense. So that’s why I think that it’s a great opportunity. And I would advise all these companies, especially on the earliest stage, to get in contact with you at Big Time Data, because there are many pitfalls and a lot of advice that you can give to navigate the space and help them get value out of the data faster and reduce, of course, their costs, because when it comes to data products, mistakes cost a lot.

Kostas Pardalis 56:58

So Rachel and Alex, thank you so much for being with us today. Pretty sure, we will have another show for sure. I mean, there are many, many more things that we have to chat about, more business oriented things, but also more technical things. I think that one hour is just not enough. I mean, you both have so much experience in this space, that there’s so much value that we’re going to give to our audience. So I’m looking forward to having another show with you in a couple of months.

Kostas Pardalis 57:26

Yeah, absolutely. We appreciate you having us on. And yeah, I mean, you said, there’s so much going on in the space. It’s exciting. And I think Rachel and I don’t think we realized that when we started Big Time Data, but how much fun we’re having just being a part of this community as it grows, and just learning all that stuff.

Rachel Bradley-Haas 57:46

I mean, that’s why we started big time data, because we were having these conversations. And it just, I remember thinking, Oh my gosh, the conversations I have, you know, two or three times a week are the highlight of my week, I love talking about this stuff. And so it was just kind of surprising how much we knew and how much fun we were having. It was, what are we doing, let’s just make this our life. So it’s been very exciting. And honestly, it’s a joy to come and be able to have these conversations, you know, we have these conversations, Kostas, off of the podcast all the time with you. So it’s, it’s been very fun to just be able to dive into this. The last thing I would add is Kostas, you talked about, there’s so many tools that are coming out, right. And I think one of the things that Alex and I are really gonna try to hone in on are, what are those core components that you absolutely need? And then what are those fun little add-ons to your data stack that depending on what you’re trying to do would help you. And so, you know, I think that’s something that we could talk about in a future podcast. It’s, what are the core pieces? And what are some different add-ons that you should start thinking through depending on what you want to do. And if you have someone that wants to do AI and all these different things, it’s just really fun to think about it. But there’s so many tools out there, it’s really hard to know where to get started.

Kostas Pardalis 58:59

Yeah, absolutely. I think that’s an excellent idea, actually, for content in general, but also for another episode to have together, how we can compile this landscape in a way that can be easily digested by our audience. And also give them some kind of let’s say map to navigate this and make the right choices and have the right tools depending on their needs and the market they are in and their use cases. So I think we should absolutely do that. And I have to say that I’m really happy to hear that you’re having fun doing all this because I think that that’s the best that can happen in life, right, having fun while delivering a lot of value to many people and companies. So that’s great, guys.

Rachel Bradley-Haas 59:39

Well, thank you again for having us on here. Hopefully, we’ll be back soon.

Kostas Pardalis 59:43

Absolutely. Thank you so much.

Eric Dodds 59:47

As always a fascinating conversation with Alex and now Rachel. I think the big takeaway that I had was really just reinforcement of an idea that we’ve heard before on the show. And that is that the tooling is one thing. And it sounds like it’s just gotten way easier to build a scalable stack. But the people running the stack really make the difference. And it’s their commitment to shepherding the data and shepherding the tools in a way that doesn’t create future problems for the organization, which just aligns with sort of what we want to learn about on the show, right, the people who are behind the tools.

Kostas Pardalis 1:00:23

I think this was a very unique show exactly, because we had the opportunity to have two people that have a very symbiotic relationship, who have the data engineering and the operations from the other side. And I think it became extremely clear that the success of any kind of data initiative inside the company relies greatly on how these people in these functions kind of work together. And of course, with Rachel and Alex, they work really, really well together. But I think it’s something that whoever starts trying to build a data stock needs to have in their minds, together with the technology.

Eric Dodds 1:00:58

Absolutely. Plus, they’re pretty funny, and it’s great to have funny people on the show. Alright, well, thanks again for joining us on The Data Stack Show. Subscribe on your favorite podcast network to get notified of new shows and we’ll catch you next time.

Eric Dodds 1:01:12

The Data Stack Show is brought to you by RudderStack, the complete customer data pipeline solution. Learn more at RudderStack.com

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 30:

The DataStack Journey with Rachel Bradley-Haas and Alex Dovenmuehle of Big Time Data

March 24, 2021

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter