Episode 124:

Pragmatism About Data Stacks with Pedram Navid of West Marin Data

February 1, 2023

This week on The Data Stack Show, Eric and Kostas chat with Pedram Navid, Owner of West Marin Data and frequent contributor to substack. During the episode, Pedram discusses the modern data stack and its complexities, modern tooling, early-stage startups, and more.

Notes:

Highlights from this week’s conversation include:

  • Pedram’s journey into the world of data (4:05)
  • What should the datastack at an early-stage startup look like? (9:53)
  • New ideas surrounding access control for data (24:45)
  • What can data teams learn around complexity from software engineering (30:55)
  • Scaling up instead of scaling out in processing data (37:40)
  • Why DuckDB is making so much noise in the market (41:06)
  • Final thoughts and takeaways (53:25)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:03
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Welcome to The Data Stack Show, if you have followed LinkedIn or substack influencers in the data space, you’ve probably come across Pedram Navid. He is a really smart guy who has written some really helpful articles on lots of data related things. I actually found his content researching several topics before meeting him. And we got the chance to meet him and invited him to the show. And I’m super excited to chat with him because he started out in finance, and the financial world with data and then was at several startups in the Bay Area, most recently Hightouch. And now he’s running his own consultancy. So where am I going to start with my questions? That’s the difficult part. I think one thing that I do really want to dig into with him, which we haven’t talked a ton about on the show, is data stacks at early stage companies. You know, we’ve talked with a lot of startup founders who have created startups, especially in the data space. Obviously, we’ve talked with a lot of data practitioners at various sizes of companies. But I don’t know if we’ve talked with many data practitioners who have done this at multiple, very early stage startup companies in the SAS space. And so I think that’s a really helpful thing to think through for me and for a lot of our listeners, by getting an opinion from someone who’s done this multiple times over about what do you actually need in that stage as a company, in terms of your data stack? And then the other question I want to ask is, are you thinking about scale? You know, because generally, startups need to become hyper growth, or at least that’s the plan. So those are my two big questions. What is the dataset look like? And then how do you think about building it in a way that can scale? You know, if you hit the jackpot?

Kostas Pardalis 02:20
Yeah. For me, I want to start with what you learned from him, like what’s the difference between working in a very Ardennes regulated? industry likes finance, where he initially was working out? And then going and working like in a series? Yeah, that’s huge. And also what is helpful to keep from the work on big and probably bureaucratic life organization, when you go and work in Eltek environment, like as soon as a pre post product market is free, but pre growth, let’s say, stage company where things are like take console, A, but it will be awesome to hear from him, like what he found useful from his experience. In doing that, that’s one thing. And the other thing is that drama is like, exposed to all the new things that are happening in these industries. Like you hear from Jim, like, what’s his take and opinion and some technologies like duck jerky, for example? Yeah, this whole thing of, okay, let’s scale out or scale out wins, what we should do with infrastructure, and how we should process our data. So yeah, let’s,

Eric Dodds 03:54
Let’s talk with him. Let’s do it. EDrum. Welcome to The Data Stack Show. It’s been a long time coming.

Pedram Navid 04:02
Thanks. Glad to be here. All right. Well,

Eric Dodds 04:05
we’ll start where we always do give us your background, especially the parts of that how you got into data in the first

Pedram Navid 04:13
place. So old story now. I started at a bank a long, long time ago. And we had data coming in from a vendor through PowerPoint slides. And had two columns, one for this month and one for last month. And that was all the Decatur Flink had, yeah. And every month, they would send us a new PowerPoint slide, replacing one column with the other. And so I think my boss asked, is there a way where we can kind of figure out what’s going on, month by month over a trend? And so I would hand copy this data from PowerPoint to excel. One thing led to another and I built a dashboard. Eventually I learned VBA you Because I got tired of doing things manually. And that was really the gateway drug into the rest of my career, Python, our data science, all that happened in the span of 12 years. Then I moved to the Bay Area. And I thought, you know, in banking, let’s jump into startup life, working at a few different startups, the data scientists, eventually the data engineer, because I thought data science just took too long to get results. And one thing led to another, and most recently, I ended up at Hightouch, as their head of data, doing data, marketing and product.

Eric Dodds 05:44
So many questions. Okay. One thing from the early part of the story, were you seven, it sounds like you sort of went through your learnings, you know, sort of, you know, VBA, you know, through to Python, and then, you know, other subsequent, you know, subsequent languages and methodologies there. Were you doing all that at a bank? And if so, were you sort of teaching yourself and bringing that technology into the bank? And the reason I ask is, you know, traditionally, we think about banks as sort of being resistant to sort of technological change, especially if they’re getting data, you know, delivered in PowerPoint, so we’d love to hear a little bit more about that journey. And how, you know, brought those technologies in or, I mean, what was that, like?

Pedram Navid 06:32
It was difficult to say the least. So VBA was allowed, because Microsoft Excel was allowed. And so you’re allowed to say that. I learned VBA. Yeah, on my own, painfully slowly, I think as most people learn it, I doubt many people go to schools for VBA. So that was just the beginning. And then I, as I was searching, I found about this thing called Python. And I probably wasn’t supposed to download it to my bank laptop, but I did. And so that helped a little bit with the automation. And again, it was all really self driven, self taught, just trying to solve problems I didn’t want to do myself. It was like, purely motivated by laziness. And I’ve been, I think, to this day, that still don’t try being backed by what I do.

Eric Dodds 07:24
I love it, as we,

Pedram Navid 07:26
as we move towards things like are to actually do real business modeling and analysis. That’s what I got the most resistance to. We were doing compensation modeling for 12,000 employees in Microsoft Excel. And we were passing down this one spreadsheet back and forth. And FTP. No, you know,

07:47
man,

Pedram Navid 07:49
email, maybe SharePoint if you were lucky. So I’m just moving through my head, everyone’s changing these models, they’re dragging and dropping, and stuff is changing, and things are breaking. And no one knows, obviously, right? And six months go by you broke out your competition model. And you have to figure out why the numbers aren’t right. And you go back and you find that some guy accidentally filled the wrong column and spreadsheet. Or even worse, the executive changed their mind with the package. You look at it every five minutes, then you go back and update 50 Live apps, and try to recalculate things. So I thought there must be a better way. I learned about this thing called AR, I was learning about data science on the side. And so I thought, what if I put all this logic into code instead of into a workbook and try to automate some of this work? Our vP was very, very upset. He did not like it. He thought Arb was a black box. And I realized what he was mad about wasn’t using Arb, he just wanted a spreadsheet. So I will do all the work and then I’ll put it in a spreadsheet at the end of the day and give it to him. You could still have that and everything was fine. So yeah, yeah, though the work of you know, appeasing your stakeholders that ever ends.

Eric Dodds 09:08
Yeah, yeah, that’s such a good insight. And it’s funny to hear it. It’s funny to hear the concept of our being a black box, because, I mean, nothing could be further from the truth, but towards, yeah, perception is perceptions reality. That was super helpful. Okay, well, let’s, so then let’s fast forward to move to the Bay Area. You are involved in multiple startups, most recently Hightouch. And did a bunch of data stuff at early stage startups. So, you know, in our chat beforehand, you were saying, you know, sort of, you know, seed to series, a, you know, stage of those companies. And one thing I’m really interested in, that I’ve wanted to ask you for a while, is what your take is on what the data stack at an early stage startup should look like.

Pedram Navid 10:05
And, you know, there

Eric Dodds 10:07
are a couple of motivating factors, one I’m selfishly interested in, because you know, I’m involved with that every day. But it’s not something we’ve talked about on the show a ton. You know, we’ve talked with people running startups, running data startups. You know, we’ve talked with enterprises, but we haven’t really honed in on them. Okay, you’re a really early stage company. You know, what does your data stack look like? And then? Well, I’ll follow up with Part B of the question. Yeah. So your Series A, you know, sort of late seed stage, and you’re running data at that company? What do you actually need?

Pedram Navid 10:46
And I can’t just say it depends, right?

Eric Dodds 10:51
Well, just explain that. Just explain what the dependencies are.

Kostas Pardalis 10:56
Yeah, let’s say, all right. My motivating factor, whenever I do things is I need something that I don’t need to babysit. And I’m willing to trade off cars for engineering time, because I’m just one person. And again, I’m very lazy. But I’m also probably busy doing other things. I need some religious work. And I think in those early stage startups, your data’s usually not very big. Yep. Right. And I might say, the plasmid, but I might argue that your DNA is not that valuable. When you’re first starting out. It’s good.

Eric Dodds 11:37
Can you unpack that a little bit more? I agree with you. But that’s a bet. I think that’s really helpful.

Pedram Navid 11:45
If we are the goal data to help drive decisions. At an early stage company, you don’t have that much data, right? Because there’s so much happening. Yeah. And you probably know, every customer you have. And you probably know how you close that deal. Where are you copying from? So what are you really learning from a really complex data stack right now, you’re not building models, you’re not scoring leads, you’re not doing marketing attribution. At the end of the day, you’re maybe counting revenue, and maybe getting a number of customers. That’s really the value when you’re first starting out. Now, it’s good to start with that stuff. Because as those complex questions build over time, having a nice foundation, they’re easier to answer. But thankfully, I don’t need to invest. Unless data is your product, you probably don’t need to invest a ton. And to do this, you’re merely days.

Eric Dodds 12:50
Now, it makes sense, I think, you know, one specific example of that I’ve experienced multiple times is that things like multi touch attribution are extremely powerful, but you actually have to have a pretty huge amount of data. And generally, a lot of paid programs run in order for a multi touch attribution model to really be additive in terms of shifting marketing budget, right. And when you’re not spending a ton of money, you know, you can spend a lot of time developing a model that might be accurate. At the end of the day, you know, it’s like, Well, okay, we’re gonna move, you know, 10 grand, you know, from this bucket to this bucket, it’s not, not a huge deal. That’s super interesting. Okay. How about scale, though, right, because in an ideal world, these early stage startups, you know, hit hyper growth and scale really quickly, you know, and when that happens, like, tons of stuff breaks across the company, you know, which is just the way that things go. And, you know, people have to fix all sorts of stuff. You know, from org charts, to data stacks. What do you think about that aspect of it? Right, like, early on, you want something that just works? It’s a small team? Do the tools available scale? What do you think about that side of it?

Pedram Navid 14:16
That is a really good question. So if you’ve looked up, let’s go through the whole stack. On the ingest side. There’s a few options over here five, Korea, there’s your air bytes and so on. And those I mean, that scales, as long as your wallets are deep, right. Yeah. So that’s probably fine when you’re first starting out, because you don’t want to invest too heavily into that and just, it’s hard to anyway, so that is something you can always take down the road and decide, do we want to keep using this or should we build something contribute to help reduce costs, you can pay to push that decision off? Exactly. Yeah, until it’s too painful and then you Deal with it. On the Data Warehouse side, you probably are not going to go wrong with Snowflake, or big query, you probably don’t need Databricks I would assume. And kids here have a good reason to use Redshift anymore. You’re probably fine. I mean, I doubt you’ll hit scaling limits with Snowflake. Again, big quarry, the bit more question of all but again, that you really got to be pushing numbers to to be hitting problems there. And what else do you need to CBT for modeling, which, sure, you’re probably hit, again, limits there. But if you’re at the scale, where you’re hurting yourself to what’s capable through that stack, then you’ve got really good problems, like you will have a lot of data at a time. And you can just throw engineers at it at that point. So it would welcome that issue. If the stock I built today doesn’t really scale. That’s great. Let’s hire.

Eric Dodds 15:59
Yep. 100% 100% Yeah, I think I’m thinking about some of our, you know, large customers, and you have to be at a pretty big scale to sort of, you know, I’m thinking about ones that have migrated off, you know, Redshift into, you know, almost going fully on to like, data lake infrastructure, right? But you’re talking about like, unbelievable, unbelievable scale. When you sort of outpace like, you know, basic warehouse stuff, which is super interesting.

Pedram Navid 16:36
You could probably get away with Postgres, if you really wanted the data warehouse. Right? That probably already you will hit limits on. So that’s where I think maybe just go with Snowflake and hope you don’t. But if you’re a Cost Conscious, and you just want them to be cheap and simple. Postgres is pretty strong, powerful. Yeah.

Eric Dodds 16:58
Yeah, super interesting. Okay, other than the tools that you just mentioned, and then I’ll pass the mic over to Costas. Because, of course, like, the rhythm of the shows that I’m monopolizing, then he does. What are the nice things for you? Right, so I understand, like the core infrastructure ingest, you have warehousing, you have a modeling layer, you know, in the early stage, that’s all you need. Are there any sort of, okay, you have a larger budget than you expected? So I’m gonna just, you know, I’m gonna do some quality of life, or some, do you have any preferences around things that you would add to that stack?

Pedram Navid 17:38
I don’t believe in quality of life from the data team. I just haven’t seen one that increases my quality of life enough to justify the expense. For me, it’s much more tactical, like, planning up for the future. So I’ve got my basic data stack, probably going to need BI, right, so maybe those aren’t

Eric Dodds 18:00
what I was gonna ask about that. If you didn’t mention it,

Pedram Navid 18:04
you probably will leave BI tomboy. Maybe you start with a superset, it’s pretty cheap, free. Maybe you decide. You need a semantic layer. Because the demands in your team are growing high, then you move to Looker like that. Because they’re all valid places to be. There’s Metabase. There’s nothing wrong with any of those. I think those are all highly dependent on your team. I want to call on this to have you probably need at some point, it’s just like, when is the right time? Product analytics, it’s another one. So getting data from RudderStack into Amplitude, or any of the other ones out there. Feature adoption sort of understand you envision growth, all that kind of fun stuff. That I mean, that’s usually driven by demand, not by what you just want to do for fun, right? So if your marketing team and your product teams are asking for this stuff, you have to find a solution. And the solution usually isn’t writing SQL queries or funnels because nobody wants or knows how to do that. Instead, you give them something seltzer. That’s kind of how I look at it. Everything else just seems, I don’t know. I need something motivating for me to go get it. It was like, theater quality is always one people talk about in this catalog. There’s metadata. Those all seem nice to have. But when I go out and spend my marketing or my data dollars on it, yeah, not unless I had a pressing need.

Eric Dodds 19:34
Yeah. Would you throw some sort of orchestration tools into that bucket? I mean, I think about the cataloging and orchestration. Again, we’re talking about early stage startups here. We’re not talking about the download these tools in general, right, because at scale, like obviously, data teams are running all these things, but the cataloging piece and the orchestration piece, I sort of see as realized a next level where you have a growing data team, and you have a level of complexity where, you know, those have a lot more appeal, but in the early stages, like they they actually add more complexity in some ways that quality of life 100%

Pedram Navid 20:16
I mean, the other day, how big is your data team? Right? Do you really need a catalog when you’re the one building every table? Calton everything, right? So, I mean, we can build a catalog and pretend that we’ll put it in front of all our stakeholders, then they’ll go look at it. They never do. It never will just start a thing that they’re ever going to do. Data Catalog is for the data team. At the end of the day yet, if I’m the data team, I don’t get the network problems at scale or what those tools tend to do. In the early days, those weren’t your problems. Yeah,

Eric Dodds 20:47
yeah. super interesting. Okay, actually, one more one more question. In that same train of thought, sorry, it costs us. Have you learned any lessons around like, when to introduce or even how to introduce tooling. Because I think you make a really interesting point on something like a cataloging tool, where you can take something that inherently, in and of itself, is very useful. It can be extremely useful to teams to drive data discovery, etc, like, especially at scale, but you can introduce those in a way, especially to stakeholders without context in a way that really paints those tools in a bad light. Or even, you could even think about, in some cases, like a tool like DVT, which, you know, feels ubiquitous to us in the industry, right, but can seem redundant to someone whose context is we’ll just write a sequel right in your warehouse. You know that that seems redundant, right? Have you learned any lessons on like, when and how to introduce tooling in a way that, you know, sort of drives wider adoption, if it’s something that you have a lot of conviction about not talking about the quality of life stuff, but something you have conviction about? I don’t know if it fits nodejs. I think

Pedram Navid 22:09
The tool that I tend to introduce is always driven by demand that day and so when I look at tools that are more cross functional, like no one cares about the tools, I use it internally. I mean, why would they? It’s like caring whether or not some of the things felt like, it doesn’t matter what the engineering team uses that concern for them. Most of the concerns for the data team are really the team concerns. No one cares if we’re using PPT, or.or, Snowflake or BigQuery. Like those are your sort of issues, I think where it becomes tricky gets the stakeholder tooling. So your BI layer is really the interface between your team and other teams that are all similar . It’s an interface between your team and other teams. Although I would argue cataloging is really most useful within data teams. So that’s really the way I look at it. And if it’s something that external focus at the Amplitude, like the exit and look here the light dashes did that it’s definitely a mutual discussion about what are your needs? What types of workflows are you going to use? Let’s call this POC together, like, it will never be me just making a decision for everybody. But I want our stakeholders involved so that they have buy-in, and you can see the value of the decisions we’re making that at the end of the day, they’ll be consuming this far more than I will. So let’s make sure that they do. And for the most part, that’s work, they tend to love the tools that we pick together.

Eric Dodds 23:40
That’s great. Well said. Wonderful advice. All right, Costas. Thank you, Eric.

Kostas Pardalis 23:48
Thank you for giving me that microphone. So in the bedroom I have a question. It’s been like, I don’t know for 510 years now that there is some kind of explosion in terms of and pull it like innovation or new products or whatever. Like when it comes to working with data rights. We’d have the modern data stack if you just take like a mob or the model data started to build into friendly products that

Pedram Navid 24:14
would steal a lot right. And

Kostas Pardalis 24:20
you will hear about quality about store ads, modeling somatically years I don’t know mathematically years whatever. There is one thing though that I don’t hear about Mars and maybe my fault but I’d love your thoughts on that because you’re also coming you came from a very regulated industry banking rights and you moved into like series A companies were always like thinks I’m like much more scrappy when it comes like to how we regulate access around data. But what’s going on with, like, access control over the data I have, like, how do we control it? What’s going on with his data? Or who has access to that? Or how do we say it? How do we process it, or when someone comes in is, oh, I have the right to be forgotten or whatever, go in, like every whatever Excel reference, like you can reference in an Excel document, you have in your combined here, part two remotely. So what have you seen there? What’s your opinion? And is it my fault? I don’t hear that much about that. It’s definitely not your fault. I would be the marketer on this one again. So there are things doing a great enough job of educating you. There are two companies I know of in this space. So it is not very big in muta, I think it’s one. And I just talked to one called Jedi today about this. And they’re both trying to approach this, I guess, problem, access control and visibility into who has access to what. And the problem is, there’s just so many tools that you have to regulate access, or, yeah, if you think of it, you have your data and Snowflake and it goes into Looker, just those two tools. That’s probably two completely different sets of ways of managing permissions. And it’s not enough to manage it just on Snowflake and hope the rest works, because the way that he is going to work, you might have access to payments, data and Looker that you can expect. So getting that right, I think it’s really hard. And I don’t think many startups are actually thinking about it or worried about it. I think it’s pretty open in the early days of who has access to data. And people tend to lock things down not because of the regulatory side, but more because people aren’t using the data correctly, at least in my experience. I tend to default to having things open initially. And then that always backfires. Because everybody’s going in according to data coming up with answers, and they’re always wrong. And they’re asking you to check their grades for them. You’re like, wait a minute, no, no one gets like this anymore. That’s the access control that we have the startups really backing is totally different. Obviously. It’s very regulated. To an incredible degree, where it took, I think we had a typo on a field in a dashboard. And I requested it to be fixed. And it was a three to four week estimate, could have had to go through like a different team and need to pay with brown dollars that come back and get approved and all that separates all these players, specific to typos. So I never want to work in that environment again, but probably something we could learn about, you know, maybe hear you a little bit more about where that goes to one and how we manage permissions across the data stack for sure. Yeah, yeah, I think you made a very good point. It’s not just about the data only. It’s the overall resources around data that you have to govern somehow. And it’s not only security, or privacy. It’s also like how easily things can turn into a mesh

Pedram Navid 28:18
Bigeye.

Kostas Pardalis 28:18
I’ve seen like when you for example, you have a big engineering team, and UD boxes, like to everyone on the Snowflake instance, like the things that will happen there. And now you’re not good. Eric knows. Eric knows. I think one of the results of this policy was having a database named after his name on Snowflake.

Eric Dodds 28:45
That bad boy is still in production, really. So Eric DB lives on. Eric dB. Eric, DB will live on. I will give it up when other SEC IPs. Yes, DB still runs production dashboards.

Pedram Navid 29:06
Wow. Yeah. Because like I said a while like when it starts having like many people getting served from these resources. It’s not that easy to decommission it. It’s definitely ironic and expensive because not everyone knows what Snowflake charges to be. Purely doing small quarries every five seconds. Oh, the data is small. How much should it cost? Well, because $20,000 over a year. So I think people will care about government ID governance. Eventually, at some point, and it’s just like, How many times have you gotten burned before you do? Yeah. I didn’t really care about governance. And my first startup, but I certainly cared about it. I mean, you just it’s easy to see how things go wrong. People make mistakes. and no data team wants to be faced with another question about why two numbers don’t match. Because this guy over there queried something and got what they thought was the right number, and it’s your job to go and unwind this 15 Page query that they wrote to figure out why these two numbers are different. That’s a very, very good point and brings me to like my next question. So okay, resource management in general. And like in a pretty complex like environment, it’s not anything new in engineering, right? It’s just thinking about someone who is like an SRE, like a DevOps, medium sized, like a stirred startup doing AWS, like, the complexities, SaaS play Crazy over there. That’s why we have grown as ASIC or Terraform. Or know me, like all these things out there. So Software Engineering has, like many years now that’s dealing with complexity. It’s an immense complexity as part of the productization issue, not just like complexity, because the problem is complex at its root, as a science problem. There is a lot of discussion about bringing, let’s say, best practices from software engineering into the data space. Good example of that is DBT, for example, right? Like it’s how it enables workflows and best practices from software engineering. Where do we stand? With that? Do you think there’s more than, like, data teams can learn from software engineering, like building things out? Then you should become us engineering software engineering themes and suffer the same things? Or there’s some kind of like, space, or the new bar? Are these there that are, like, you know, applicable only for data things? It is a really good question. Certainly DBT has helped, I think, I remember the old days where data teams, and many still do this, your SQL query was saved in a text file on your duxtop. And there was no version control, you just had to ask someone how they read something, and they would send it to you by email, right? So we’ve come a long way, I would say, especially on the data model, the transformation side, a lot of the tools in the ecosystem are also moving towards that model, right? They’re building in things like version control, and declarative, like YAML configuration, or how you set these things up. I think that’s all great. But I do wonder if data thieves themselves are sometimes missing the bigger picture of how these things work together. If I think back to the older data engineering types of people, they tended to come in through more technical backgrounds, right? They came in through computer science or software engineering, and they learned about all the trade offs there were between, you know, performance and how data moves to systems and what it means for data to use a cache or to go to your drive or disk or to go through the network and what all those things meant for response times. That type of stuff. I think most engineers kind of understand and know well. And then all the associated stuff that comes around like deploying Docker containers, Kubernetes, and all this. It was kind of like they learned this stuff, because they had to add, I think, noi has been really helpful. I do think there’s a lot of people coming up with data outside of that. And maybe they haven’t had exposure to that side of the world. I do see it sometimes biting us a little bit when we’re starting to move data into what is really a production setting, without some of that understanding of what software engineers have learned over the years. So maybe our tooling is good. But I don’t think the conversation about how we think about moving that stuff around has really happened yet. What does it mean to quarry Snowflake like how did that actually work? And what does it mean to transfer data outside of regions? And what did that look like a Cox and that type of thing. So I think that type of stuff, we still need to maybe do a better job of it’s still early days. But when you look at it from pharmaceuticals, we’ve definitely come a long way and do you think there are two things that are missing? Or let’s say knowledge or best practices? I think we’re told it’s actually pretty good these days. Is it really best practice? It’s knowledge. And I think it’s learning from each other. We don’t tend to talk too much about this stuff. Right? When I look at the talk people do and data. It’s always about the tooling itself. But it’s really about how we like to move stuff into production, or how we think about different trade offs in terms of performance characteristics like that type of Flexity. Does it come up in my mind, versus some of the other types of talks we’re having right now? Yeah, that’s an excellent point. habang, we think that’s better conferences, more collective prophecy. I should be writing more about this stuff, too. Like, I’m just as guilty as anyone else. It is happening, people are asking questions. Jake admits it. For example, He created the modern data stack in a box. Not too long ago, that project talked. I will work with him to build Docker Kubernetes into it. So if that’s something you want to learn more about, you should check out its GitHub repo, it has all that stuff in there. It’s still early days, but I mean, hopefully, this is part of that conversation, too. Right? Yeah, that’s great. You mentioned conferences, do you have any favorite conferences out there like any, I don’t know, like, conferences that you really got a lot of value from, not from the networking part. And like all these things, but also like from, you know, like the content that was created and how it was delivered as part of the conference. On the data side, not a ton. I am really jealous of them. But the software engineering conferences that I see out there, like PyCon, for example, have always been really good. Our studio used to have a good conference a few years back, I think less so now. It’s become much more ecosystem platform focused. I think all corporations are kind of like, end up that way. At some point. If they’re run by a vendor, though, maybe that’s just inevitable. Normally calm, I have to give a shout out to that one that looks really good by Vicki Boethius, that’s coming up in a few weeks, actually. So it’s free. It’s online, like an 18 hour slog, definitely check that one out. A lot of good people are talking about what? That’s cool. Well, some great resources. Cool. And okay, next. My next question is about, as you mentioned, when you were talking with the Cabal starting, was it like The Data Stack Show for a new company, like getting on the scale of your ads? There is it or at least it feels like there is some kind of change in the mindset of people in the industry right now. Instead of going and using systems that scale out, like to try and build systems that scale. Right? And I think it’s a very good example of that technique, right? Something that you can run locally, it’s going to fry your CPU, because it’s going to use like every last register of the last core in there like to process data. And people are interested in that. What’s your like? What’s your take on that? Like, how do you feel about

Kostas Pardalis 38:36
this? Still trying to figure it out, I think, is my take. I really liked that TV. I use it locally a lot. But to me, it’s like SQLite, like a great tool for the right context. But you rarely will deploy an application using SQLite. You probably move to Postgres, right or MySQL. But it can be great to have SQLite for your test cases, because it will run faster your office admin infrastructure, like that’s fine. That PV to me feels like it’s either middleware within someone else’s application stack. Or a great tool to use locally because you don’t want to move data out. That totally makes sense. But if your production data isn’t in your Cloud Data Warehouse I don’t know how bringing it locally to your laptop is going to solve any of that. The tough argument to make I don’t know, but we’ll see. Yeah, I haven’t seen its word but that doesn’t mean it’s not out there. Okay. So how do you typically use it yourselves? Like, for example , I mean, okay, whenever I’m, like, needing to do something like quick data and I prefer to ship it obviously and I don’t want to load the data, or, you know, like that kind of stuff. Yeah, like, like the beasts like rage, right? And you can’t do that like with quite a lot of data also, it’s like it comes to like pretty well like on your laptop. But how do you use it? What are some interesting use cases for you? I use it the exact same way. So I’m working on a little side project to do entity resolution and benchmarking different methods using it. So doc DB is great for that. Because I have a couple of files on my laptop, I want to read them in, I don’t want to send a Postgres file, I’ll load it to the web, I can read some sequel, do some aggregation on top of it. That works pretty well. That’s really the only use case I have. But I’ve heard of other people doing more important things with it. So I’ve heard of people using it as part of an ETL pipeline, but they’re not deployed to production to speed up some type of transformation they’re doing. Yeah. And so I mean, that kind of makes sense, right? It’s just another tool in your toolbox. Yeah. But for me, it’s really been, I guess, just like local development and playing around, and not having to spin up more infrastructure to play everything. Yeah. Why do you think that it has created so much noise in the market? The reason I’m asking you is because, like, recently, I was thinking, because I’d like to download ClickHouse and around like with big ClickHouse. And to be honest, like kickout doesn’t have that much of a different experience for working with local data, right? Like, it’s single binary, you download these, like, has a lot of tooling, like amazing support, like for importing data, and like waiting the data, amazing performance to like, you can do similar things like as you do with that, but okay, ClickHouse has been known for different kinds of use cases. I’ve never heard anyone say, let me download it and do something local. Right. But so why that debate, but they do so right. And they create this kind of perception in the industry? I have no idea, to be honest. And I’m always scared to speculate, because they’ll come after me. I don’t know. I mean, people love it. So they must be doing something right. Like, it’s a genuine useful tool, like Bode uses that companies are using it in their production applications as part of middleware that wholly makes sense to me. It’s nice having a way to read a bunch of CSV, Parquet files on your computer, that was traditionally a little bit harder to do than fab. So I mean, it’s great. I don’t know why it became so popular and so loud. And yeah, I don’t know. It just took the world by storm. I can’t speculate on why. But I’m happy for them. Okay. Which brings me to my last question, before I give the mic back to take marketing and content around these technologies, right. There’s a lot of education that needs to happen. Like when you educate people how to use tools, when maybe, I don’t know, even with DAG to be probably they did something right with distribution of the technology, which always includes marketing there somehow, maybe one day we’ll learn what’s, what’s the magic there. But you’ve also worked at Hightouch. Right? And like PyTorch, again, you were like, part of the team and the product that was new in the industry, like reverse ETL was like something like that point. So based on your experience, what are some really good tools for reaching out to people out there in helping them to understand the value of the tools and become better data engineers or data scientists or whatever on like, when they have to work with data? Yeah, I don’t know. Is it full? But I mean, the way I always look at it is like, where are you? Where are the people who you think would benefit from your product? And then if you truly believe that your product has value, how do you teach them about that value? The end of the day, that’s all it is. Like all I think marketing is and when viewed from that lens like it makes it easier to think of how you would like What are the possible steps you could do? So I can walk through how I thought about it. Hightouch Hightouch, I knew what the product did, it helped move data, for example, from the warehouse to Salesforce was one very simple use case, right. And I knew who benefited from it was people like me who used to have to write this code manually, usually through the Python integration. And so having a good understanding of what the value is, and who it’s for, marketing becomes very easy. It’s okay, well, if people like me would benefit from this, how do I reach them? Well, do they know what reverse ETL is? And in the early days, the answer was no. So we have to educate. And so a lot of my work was spent educating people on what it is, what the value is, what it means, why it’s different from x, y, and z. Once I kind of had a good bit of understanding of what that was. So the next question is, how do we make people aware of our company, right? And that’s a little bit harder. And there’s no shortcut. It’s just me just like constantly creating content to bring people to our website, that data people would find genuinely useful. And so I would just write about things I was curious about. For the most part, were things I had learned, I think those two things are great places to start. And so I would create content on things like the difference between Airflow ditest, repeat that something I’ve always wondered about. And if you go and Google it, you will find much, you’ll find, you know, marketing pieces to talk about a little bit, but no one’s actually tried all three and written about it. So what I did was, I downloaded ball three and wrote about it. And that became a great source of traffic to your website, because it was the only thing that had covered all those things. And so that’s usually the way I think about it. It’s like, how do I generate something useful for people? What do I have a unique perspective on? That hasn’t been done before? If you can do that, then hopefully that brings people to your website? Yeah. Makes total sense. Eric? Microphone is yours? Sure. I did.

Eric Dodds 47:14
I’m sorry, sir. I am so excited. Oh, yeah, it was that sets me your bedroom. We’re both excited, at least I hope so. I’m really interested to know, headroom. So you are now consulting, you know, which is relatively recent. And you came out of doing sort of data and marketing at, you know, venture backed, most recently a venture backed and data company, right. In the vortex in the marketing vortex, you know, in the data world, you know, in for venture backed companies for data vendors is, you know, it’s pretty intense. I mean, that’s what I live in every day. But now you’re consulting, right. So you have companies that bring you problems, and you need to figure out the best way

Pedram Navid 48:18
to solve them.

Eric Dodds 48:22
Have you had any changes of perspective going from the world of venture backed data vendors to, you know, companies paying me to help them solve, you know, pretty specific problems.

Pedram Navid 48:40
I think I quickly realized how far ahead we all are and work best. When I started to talk about the modern data stack, the number of companies out there that are actually implementing it, there’s a very small number of companies to know about it, or a small number of companies to know about DBT are actually quite small. You talk to most of these companies, they don’t even have data teams, the time. Now, maybe that’s selected bytes after talking to me. But a lot of companies out there don’t have a data team. They have people who know what they want and have found ways to get it, for better or for worse, often for worse, which is again, why they’re talking to me. So I think we have been in a bubble. I certainly have been in a bubble the past couple years. And I think a lot of our spenders are kind of guilty of that. Pushing a system that’s actually pretty complex to people. And not to say that it’s not useful or good. It’s the same one I will implement a lot of the time but I think we often forget how far ahead we are and where we need to start a conversation with people. But like, we probably can’t talk to people about the merits of, like, data diffing within a data warehouse, when they don’t even know that they need a data warehouse, right? So a lot of my work is really back to basics and trying to figure out things like, how do we teach people what the state of science is all about, without confusing them, that’s already hard enough. And then probably the harder thing is to show them what the actual value is doing all this work. Because if at the end of the day you put all this work, and all they get is a report while they were already getting that before they started talking to you. And so hopefully, you can say, well, what you were doing before, serve this need. But let’s talk about not just doing what you were doing before, but all the things that we can start to do. Now that your data is centralized, we can bring in data from three, four different systems, we can start to be really nuanced about how we look at attribution. And we can look all the way down your product level to see where different channels interact with each other, when people want to activate or read revenue. That’s when I think people can start to kind of see what actually possible data, when they come to you is, hey, I need to know how many customers I have. And if you just sort of stopped the conversation there and given that with the data warehouse, great. Why did I pay this much money for this? Right? I could have kept doing that. But what you charged me. But if you could start to bring the boat? That’s right. Like, the whole point of this is to actually bring data in from different systems that start answering questions that you weren’t able to answer before, and you’re actually going to give you insight into your business, then I think you can start to sell them on this idea. And that’s where most customers are. They’re nowhere near where we are today where we’re talking about version control data model, you know, it’s your ability and all this stuff. No one has any clue what any of that stuff means.

Eric Dodds 52:17
Okay, last question. And I would love for you to speak to our listeners who are. And of course, with podcasting analytics, it’s really difficult to know how large this subset is .

Pedram Navid 52:30
Do you have millions of viewers?

Eric Dodds 52:35
Millions and millions? How do you break out of that bubble? Right? If you are working in a context, I’ll try to broaden it if you’re working in a context where you’re sort of in the data echo chamber. And that’s your job day to day. How do you break out of the bubble? That’s a good question.

Pedram Navid 52:57
Get off Twitter and get off slack and CO meet a real company? I don’t know. Yeah. Yeah. Like how do you talk to people who aren’t even talking to you? I think it’s a tough thing to do. How to Talk to people who aren’t in data, which is you can when you go outside, talk to people and ask them what questions you’re asking with data and how they’re solving the same problems that you’re solving. Because I get that even people are doing it. Like I’ve seen people do marketing attribution in Salesforce. I have no idea how it’s done. But I know it’s a pretty common thing typical people do and it’s, well, they don’t have a warehouse. How are they doing this stuff? So the more you can talk to people outside of the data world, the better I think it will be all

Eric Dodds 53:44
of us. Yeah. Such sage wisdom, headroom, this has been a really wonderful show. It’s flown by, and we’d love to have you back on set.

Pedram Navid 53:57
So great. Happy to come back anytime.

Eric Dodds 54:00
My takeaway from Kostas, which has been a recurring theme throughout the show, even from some of the very, very early episodes, is that generally keeping it simple is the best policy. And if you hear, you know, Pat drum, who is probably more than anyone, you know, familiar with the most cutting edge tooling in the data space, you know, even, you know, stuff that very small startup companies are building. You know, he picked a couple of core pieces of technology and said, This is what you need. And when you start to break it with scale, then you’ve hit the check. You know, since when you talk to practitioners, I just love how simple it is for them. They don’t use fancy technology, or sorry, they don’t use fancy terminology to describe technology. They just talk about the utility of various things that are required of them in their job. And it really is pretty simple. And so I guess, you know, or some of the conversations that we had about thinking with them can get really tricky to navigate all the marketing terminology. And I’m, of course, someone who’s creating that problem actively. And the data says, I love the simplicity. Yeah,

Kostas Pardalis 55:31
I think prodromos like a very pragmatic approach to things, which is, first of all, super valuable for someone who’s doing his job or consultant, right? Because at the end, if you are a consultant, one of the biggest values that he can deliver to your customer is to go with and sell them, like focusing on what really matters for them and making the right choices. Right. And so pretty difficult, like to avoid these forego hype, you know, like, it’s 70. Yeah, a lot of, you know, like, a cheerleader of something. So it’s, I don’t know, I really enjoyed the conversation with him, because it was very, you know, down to isn’t very pragmatic. And yeah, like he talked about, like, the real problems. And when you have the problem, you don’t have a problem. So I really enjoyed the conversation with him. And you should be writing more and communicating the style of talking about what’s going on in the industry, because it’s super useful, and it’s missing. Isaac, like we need more voices.

Eric Dodds 56:48
I agree. All right. Well, thanks for tuning in. Subscribe if you haven’t told a friend and we will catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.