Episode 76:

Why a Data Team Should Limit Its Own Superpowers with Sean Halliburton of CNN

February 23, 2022

This week on The Data Stack Show, Eric and Kostas chat with Sean Halliburton, a staff software engineer at CNN. During the episode, Sean talks about how access to data creates more appetite for data, building a robust SDK, how to approach desire for self-service options, and more.

Notes:

Highlights from this week’s conversation include:

Sean’s career journey (3:27)
Optimization and localized testing results (7:49)
Denying potential access to more data (13:46)
Other dimensions data has (18:32)
The other side of capturing events (20:55)
Data equivalent of API contracts (25:03)
SDK restrictiveness for developers (27:40)
How to know if you’re still sending the right data (30:38)
Debugging that starts in a client of a mobile app (36:08)
Communicating about data (38:36)
The next phase of tooling (41:49)
Advice for aspiring managers (45:21)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 0:06
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Welcome to The Data Stack Show. We have Sean Halliburton on the show today. And he has a fascinating career. Number one, he started as a front-end engineer, which I think is an interesting career path into data engineering, and he is doing some really cool stuff. cost us there are two big topics on my mind. And I’m just gonna go ahead and warn you, I’m gonna monopolize the first part of the conversation because I’m so interested in these two topics. So Sean did a ton of work at a huge retailer on testing. So testing and optimization. And I just never experienced there’s data pain all over-testing, because testing tools, create silos, etc. And so he ran programs at a massive scale. So I want to hear how he dealt with that. Because my guess is that he did. The second one, I guess we’re only supposed to pick one, but I’ll break the rules is on Clickstream data. So he also managed infrastructure around Clickstream data, and sort of made this migration to real-time, which is also fascinating, and something that we don’t talk a whole lot about on the show. And so I just, I can’t wait to hear what he did and how he did it.

Kostas Pardalis 1:39
100%. He’s a person that has worked for a long time in this space. And he has experienced engineering from many different sides. So he’s not, he hasn’t been tasked to do engineering, he has been, as you said, like a front-end engineer. And I ended up at some point like, in through different things, to become a data engineer. So I want to understand and learn from him. How do you deal with the reality and the everyday reality of maintaining infrastructure around data? How do you figure out when things go wrong? What does that mean? How you believe, let’s say the same, sorry, the right intuition to figure out when something we should act immediately and what when not, and most importantly, how you communicate that among different stakeholders, not just like the engineering team, but everyone else who’s like involved in the company, because at the end, data engineering is about like the customers of data engineer always and then right? Like you deliver something, which is data that someone else needs to do their job, right. So I’d love to hear from him, especially because he has been in such big organizations, what kind of issues he has experienced and get some advice from him.

Eric Dodds 2:57
Absolutely. Well, let’s dig in and talk with Sean.

Sean, welcome to The Data Stack Show. We’re excited to chat about all sorts of things in particular sort of Clickstream data and real-time stuff, so welcome.

Sean Halliburton 3:10
Thank you. I’m super stoked to be here.

Eric Dodds 3:13
You have a super interesting history, background, as an actual engineer, software engineer. Could you just give us the abbreviated history of where you started and what you’re doing today?

Sean Halliburton 3:25
Yeah. So I come at this gig from a lot of different directions. I was actually an English major in college. Before that, I was a music major. That’s one of my favorite things about what we do is it takes a lot of different disciplines. And those disciplines come in handy in a lot of different times. I’ve also been an individual contributor, and I’ve been an Engineering Manager. And along with that, I’ve worn different hats doing program management, doing Product Management at different times, as the needs have been there. So I started out in the front ends. And so there’s another angle that I think is unusual in this field. But I’m also self-taught and it’s when I first started 15 plus years ago, it was really easy to just dig into the front-end of building your own site. It spits out some static HTML, and then slowly enhances it with progressive JavaScript. And then some of the site templating engines and WordPress is starting to come in and more and more people started tweaking their MySpace profiles and things like that. So I learned how to build data-driven websites and started specializing professionally in data-driven, you know, lead generation and optimizing landing page flows. I worked with the University of Phoenix’s optimization team for several years and really learned a lot about form flows and not only optimizing those pages to try to best reach the user and keep them engaged to convert and get more information, but also to optimize the data that came out of them that would go into the backend, and power so many things behind the scenes. I went from there to I served about six to seven years at Nordstrom as both an Icee and engineering manager, and really built out a program around optimization. And then expanded into Clickstream data engineering, and over time, got addicted to replacing expensive enterprise third party SaaS solutions with open source-based solutions deployed to the cloud, which was still relatively new in the space at the time. And that’s kind of where I’m at today with CNN as a staff data engineer. And we’ve worked with a number of tools, some we love, some of them, we thought could be better. And where we see opportunities to improve using open source tools, we have a highly capable team to do that. But interestingly enough, over the last two to three years, I think the pace of the greater community has been such and some of the key tools, like commercial databases have improved so much that I’ve come back around a little bit and embrace SaaS tools where it makes sense to for things like reverse ETL analytics and data quality, basically, post data warehouse.

Eric Dodds 6:36
Interesting. Okay, Kostas, I know you have a ton of questions about that, but let’s come back to that and I’d actually like to tackle the three things you mentioned chronologically, so we’ll do that one last. First, you talked about optimization, as we were prepping for the show, you did years of work in sort of testing, and optimization, in retail, which, you know, retail is sort of the tip of the spear in terms of testing and optimization and getting it right can mean, you know, moving something, a point of percentage, you know, can mean you know, huge amounts of revenue. But you come at it from the data side, as well. And at least in my experience with testing, there’s this challenge of sort of the localized testing results in whatever testing tool you’re using, right? So you get a statistically significant result that this landing page is better, or this button is better this layout or you know, all that sort of stuff, which is great, because, like math on multivariate testing is, you know, pretty complex. But it’s hard to tie that data to the rest of the business. Did you experience that?

Sean Halliburton 7:47
Yeah. So I have the saying that people will drive software drives people. And the tools you use, have to meet the state of your program at the time, and, conversely, are influenced by them. When it comes to optimization, you know, everyone starts with the basics, testing different headlines, different banners, maybe different fonts, then you kind of mature into not be running a handful of tests per month, you get a little bit more experienced, more savvy, more strategic, maybe you level up to a better testing platform and, and hire more analysts that can handle the output. And now you’re running maybe a couple of dozen tests per month and testing custom flows and things like that. But there’s still a limit, as long as you’re using a dedicated optimization platform. That certainly was the case for us at Nordstrom, we would generate analyses out of we were using Maximizer at the time. But those analyses were reporting things like sales conversions in potentially different methods from our enterprise clickstream solution, which was IBM core metrics at the time, based off of two completely different databases, both of which were black boxes, of course, you know, a vendor will only can only convey so much about the logic that they’re running in their own ETL on the back end. And as the technical team around these practices itself matures, it becomes more and more difficult to explain some of those results. At the same time, the more testing you do the more data you naturally want to capture around those tests. So here analyst wants to know, their questions increasingly overlap with those being asked by your product owners that are analyzing your wider Clickstream data. So I don’t think it was any coincidence that we began to look for alternatives to both of these solutions for us, and we landed on a couple of open-source options One was PlanOut, which was a static Facebook library at the time. And we developed that into service designs to be hosted internally from AWS and scale up to meet hundreds of 1000s of requests at a time. And on the Clickstream side, we plan and designed to scale up to handle more experiment results directly into the Clickstream pipeline. And we migrated from core metrics to snowplow. We leveraged the open-source versions of each one, and spent a lot of work into making them more robust and scalable. And over a couple of years, those two practices I would say, really did become one.

Eric Dodds 10:45
So what I’m hearing is you essentially sort of issued the like third party, SaaS testing infrastructure in Clickstream data infrastructure, and said, We want it all on our own infrastructure. So then you had access to all the data, right. So for analysts and results, it’s like, we can do whatever we want.

Sean Halliburton 11:06
Yeah, so this was in the early teens, and AWS itself was really still kind of in that early exploded phase were more highly capable, and agile engineering teams were clamoring to get into the cloud. I mean, just the difference between working with our legacy Teradata, databases on-prem, and spinning up a redshift cluster. I didn’t need to ask anyone to spin up that redshift cluster, I didn’t need to ask anyone to resize it, or anything, my clickstream team was able to tag on events in the front end. Ironically, we take some clickstream events using our maximizer optimization, JavaScript injection engine. And we could land the results into our own data warehouse into our own real-time pipeline. Within hours, we hacked away at this over a weekend and came back the next weekend. And we’re so energized and really relieved because the right tools can have that kind of impact on your quality of life. It became equally important overtime to engineer the limits around those capabilities, though, as well. So that was one of the more interesting learnings that we had, the more power you find you have, suddenly the challenge becomes when to say no to things, and when to put artificial limits on those powers. Yes, we have access to real-time data now. But here’s the thing, if we copy that data out in 10, parallel streams, we could have 10 copies of the data. If we produce a real-time dashboard of this metric or that metric, we have to make sure that that metric aligns with the definition of other metrics that a product owner might have access to that we don’t even know about going in.

Eric Dodds 13:08
Could you give just maybe one, and you did a little bit, but just like a specific example of well, and then stepping back just a little bit, access to data creates more appetite for data, right? You know, it’s kind of like you get a report, and then it creates additional questions, right, which, you know, sort of, you know, creates additional reports. But could you give a specific example if you can of a situation where it was like, Oh, this is cool, we should do this. But then the conclusion was, well, just because we can doesn’t mean we should.

Sean Halliburton 13:41
Yes, absolutely. So again, to put a time frame around this, I would say this was 2016 2017 18. And we had pretty large scale, Lambda architecture between our batch data ETL side, which was our primary reporting engine, but we also have the real-time aside, as I briefly described, and that’s pivoted exclusively on Kinesis. Well, Kinesis is it’s an outstanding service. It really is. I love it. It’s similarly easy to provision a stream. It’s, it’s like managed Kafka. I’m sure it’s not exactly Kafka under the hood, but the principles are the same, and it’s almost too easy to get up and running with. It’s also easy to begin overpowering yourself with we started learning so much data that scaling up a stream became I would say to put it nicely excessive overhead. It could take hours, it was an operation that should not be done in an emergency scaling situation. And it kind of relates back to one of the fundamental principles of data that I don’t think we talk enough about really, and that’s data has a definite shape. You could describe it in 2d terms or even 3d terms. For for the purposes of this example, I would, I would say, I would describe it in 2d terms and just say, it’s easy to consider the volume of events that you’re taking in, you know, it’s easy to describe those in terms of requests per seconds, or events per second flowing through your pipe, it’s easy to forget that those events have their own depth to them or their own with however you want to describe it, the more you try to shove into that payload, you can create exponential effects for yourself downstream that are easy to overlook. And in, in our case, at Nordstrom, we made a fundamental shift at one point to basically go from a base unit of pageviews, down to a base unit of content impressions. So think of it as like going from molecular to atomic. And that’s essentially what we did. And we took in a flood of new data into the pipe that we didn’t have before in a short amount of time. And also remember Kinesis only recently developed auto-scaling capabilities. So solutions to that scaling problem were really homegrown until very recent. So I think that’s already classic example of be careful of what you wish for, and know that you have some very powerful weapons at your disposal. Just stop to think about, as you said, okay, we can do it. But should we? What is the value of all that additional data? I would, I would suggest to not only engineering managers, or product managers be very deliberate about the value you anticipate getting out of that additional data because it costs money, whether it’s in storage, in compute, or in transit in between.

Kostas Pardalis 17:26
That’s a very interesting point, Sean. And I think one of the more difficult things to do, and I think that like many people don’t do at all like to sit down and consider where, when more data doesn’t mean more signal, but actually adds noise there. And that’s something that I don’t think we discuss yet enough. Maybe we will now that people are more into stuff like quality and metric repositories. And try also like to optimize these parts. But you’re talking about dimensions, and from what I understood, the increase of dimensions has an impact on the volume, right? And talked about scalability issues, and how to auto scale and all these things. What other dimensions does data have? And what other parameters of the data platform do they affect outside of the volume and the scalability of it?

Sean Halliburton 18:30
Sure, sure. So I think a good example is kind of where we started with this discussion of layering experiment data onto Clickstream data. Or it may be a case where a product manager wants a custom context around, you know, say you have a mobile app that loads a web view, and suddenly you’re crossing in between platforms. But product manager wants to know what’s happening in each context. And so you may have a variant type female, you may have a JSON blob embedded into your larger event payload that quickly bloats in size, or to here’s another example, in an attempt to simplify instrumentation at Nordstrom, we attempted to capture the entire react state shard through the Clickstream pipeline. So we could have as much data as we could possibly use, which was super powerful. But again, could be a blocker at times. So when I’m debugging in the front end, I tend not to use an extension such as snopow debuggers that give you sort of a clean text representation. I tend to look at the actual encoded JSON payloads as it normally flows through the browser and get the collector. And try to keep my fingers on the raw data so that I don’t forget what’s being sent through and ask from time to time, as part of your debugging routine – what is the value of this data? Okay, you want to capture hover events. How much intent do you expect to get out of a hover event? How will you be able to tell what’s coincidental versus what is purely intentional?

Kostas Pardalis 20:30
That’s a great point. From your experience, because as you describe this, I cannot stop thinking of how it feels to be on the other sides where you have to consume this data, and you have to do analytics, and you have to maintain schema, your data warehouse, like not like all these things, how much this part of the data stack is affected by the decisions of helping, let’s say, on the instrumentation side, because at the end, okay, adding another line of code there to capture another event? Oh, okay. It’s not bad, right? Like, it’s not something that’s going back to here, not much. But what’s the impact that it has on the other side, how much more difficult working with the data makes it?

Sean Halliburton 21:12
That’s definitely a big piece of the puzzle and a big challenge. And that’s kind of where you’ve earned into API design, right. And you spend enough time in software engineering, and you realize the challenges of API design, it’s tricky. It’s tricky to get the contract right in such a way that you can adapt it later without forcing constant breaking changes. And because those breaking changes will not only break your partner’s upstream, but they’ll break your pipeline. And if they break your pipeline, you’ve broken your consumers downstream. And I’ve always worked at places that were wonderfully academic. But by the same token, you end up being your own evangelist, because you are constantly pitching your product internally, for consumers to use, they don’t have to use your product. Mm-hmm. You know, any VP, any director can go out and purchase their own solution if they really want to, generally speaking. There are always exceptions, of course. And none of that is to denigrate any way of doing it, or you know, any leader that I’ve learned from in any way, that’s just the nature of our business. So I think it moves so quickly, you know, especially over the last two to three years. So I apologize, this is a tangent, but I wanted to highlight one of the things that I think has really accelerated tool development on the fringes in the data landscape. You know, we’ve all seen the huge poster with the sprawling options for every which way you can come at data. But I think data warehouses in general, were a big blocker for a number of years. And initially, redshift was the big leap, right, and then Big Query right on the heels of that. And then I think you hit a wall with redshift, and it stagnated for a few years until Snowflake came along. Now, we are a snowflake shop, so I can praise it directly. And we’ve been very happy with it as a third-party solution. And we also touched on, you know, lambda architectures and some of the difficulties of those, and I think a lot of the talk of kappa versus lambda has been put on the back burner. Because it’s kind of an obfuscated away with advances in piping data into your data warehouse. We’re, we’re, we’re a snow pipe user heavily. And if you had come to me a couple of years ago, and said, Well, can we have both I would have said not necessarily but now we kind of handwave the problem away, because we can essentially, it started to like using firehose, in the AWS, AWS landscape. But we can pipe our data from the front end into our data warehouse within under a minute now. So why keep a Lambda architecture around? But also, I don’t feel like we need to obsess about a cap architecture either.

Kostas Pardalis 24:29
You said something that I found very, very interesting. You’ve talked about APIs and contracts. And I want to ask you, what’s the equivalent of API contracts in the data world? Like what do as data engineers we can use to communicate, let’s say the same thing that we do with a contract between APIs? If there is something, I don’t know, maybe there isn’t, and if there isn’t, then why we don’t have some.

Sean Halliburton 24:59
So, like the optimization analogy, I think it depends on the maturity of your data engineering team. And it’s probably more typical for a data engineering team in its younger years to handle all of the instrumentation responsibilities. At some point, product owners and executives are going to want some options for self-service. And when that happens, you have a couple of different approaches. One is a service-oriented architecture, which was my initial approach and we provided an endpoint and a contract for logging data, just like so many other logging solutions, and other APIs. And that worked? Well, I would say for not quite a year before we started hitting walls on that. I think longer term, the better solution, which we have now at CNN, and I think as a major asset is we offer SDKs, that in front of our data pipeline. And you know, our primary client is web, but we’re increasingly expanding into the Mobile SDK space. So I’m not alone is a challenge. Because more languages you want to offer an SDK in you need developers that are proficient in those languages, of course, but for where we’re at right now, between CNN, web and mobile, and increasingly, CNN Plus, our Java scripts, and swift SDK is meet our needs. And I think that is a good compromise. It’s a more flexible one, especially if you’re able to serve your SDK via CDN, then you can publish updates and fixes and patches and new features whenever you need in a much more healthy manner and force make less force upgrades to those downstream teams, and by extension, their end-users?

Kostas Pardalis 27:14
How restrictive are these SDKs for the developer? Do they enforce for example, specific types? Do you reach that point where like, there are like specific dates that they have to follow? Or they can do whatever they wanted, they can send whatever data they want on the ends, right? Because if this is, again, again, the contract is gonna be broken. So how, what kind of like Moloch, you have figured out there?

Sean Halliburton 27:39
Yeah, now we’re into kind of the fine-tuning knobs have self-service. Right. And, and verging into data governance now. So we’ve provided an SDK, we’ve provided these stock functions that construct the event payloads. But yeah, there’s always some edge case, there’s always some custom request, where we want to be able to pass data this way under this name that the SDK does not allow for, or maybe there’s some quirk of your legacy CMS, where it outputs data already in some way. If only we can, we can shove it into the end Shimmin into that payload. So yeah, we absolutely there’s a line we walk, there’s a bounce, we try to strike self-service where we can offer this one custom carve out space where you can pass a JSON blob, ideally, with some limits, it’s probably an arbitrary honor system, arrangement, but we’ll take your data into the data warehouse, but it’ll still be in JSON, or okay, we can offer custom enrichment of that data once in the Data Warehouse will model it for you for a set period of time. And then past that point, either the instrumentation has to change, or we just have to figure something else out that works for both sides. Yeah, that’s a great question. And it’s always a challenge between where do they where does the labor fall? Whose responsibility is that whose ownership is that? And, you know, governance is is a challenge in so many aspects of life these days, and data engineering and end-users and analytics is no exception to that.

Kostas Pardalis 29:27
100% I totally agree with you. And I think it’s one of the problems that as this space, like the industry is maturing, we will see more and more solutions around that. And probably also like some kind of standardization, like we seen also with things like DBT, but from what I understand, and that’s also my experience, issues are inevitable, right? Like something we break at some point. And the problem with data, in general, is that they can break in a way that it’s not obvious that something is going wrong, like you can have let’s say for example, duplicate or you might be, you know, like, have data reliability issues, right? Your pipeline is still there, it still runs outside of like seeing something not ordinary when it comes to like the volume of the data or something like that. You can’t really know like, if you’re still sending the right data, right, so how did you deal with that? What kind of like mechanisms you have figured out? Like, because you have like, a very long career, like in the space like, what did you do? How do you?

Sean Halliburton 30:36
Yeah, I’m smiling because this always reminds me of the line from Shakespeare in Love where the producer is seeing this madness going on in the theater and asking, how are we ever going to pull this off. And I believe it’s, Geoffrey Rush says nobody knows. But it always works, we’ll work it out, we’ll figure it out. And that was definitely the case in data engineering. Until very recently, we were flying blind, we were there was little to no observability, we were almost entirely dependent on Cloud watch, and data, dog dashboards. But even those were not very descriptive. They will tell you how many easy two instances you had running and whether they were maxing out on CPU and would take you deep down the rabbit hole of JVM optimization. But really describing the data behind your data was surprisingly hard. When it came to describing how ETL was performing. That was really hard. For me both as an engineer and as a manager responsible for representing my program, and my team and the culture and engineers that I cared very deeply about continuing to grow. And just in terms of maintaining their quality of life, there were some downright stressful times there was definitely burnout on the team. And so again, people drive software drives people. And I knew we could do better at the time, I very much wanted tools to do that. And I’m happy to say in the last just in the last six months, as an Icee, again, back at CNN, I’ve been focusing on data quality and observability. Quite a bit. I’ve been testing different solutions. Recently, I’ve been working with read data and Monte-Carlo as observability solutions. Again, I think having a more dynamic data warehouse like snowflake helps unlock a lot. And I’ve been working on simply put data quality algorithms that can not only tell us how we’re doing but better define, illustrate and advertise our SLA s and SL lows to our partners and tell them how we’re dealing with some real numbers.

Kostas Pardalis 33:04
Oh, that’s super interesting. Can you share a little bit more around that?

Sean Halliburton 33:08
Sure. So I believe Eric mentioned DBT back a while ago, I’m very proud and happy DBT user, we’ve worked with them extensively to harden our data stack. And I’m using it to capture things like presence or absence of critical fields in our enriched tables, capturing latency of records as measured from when they land in our raw data tables versus when they reach in Richmond and our data Marts and beginning to kind of as I said, develop an algorithm starting from a certain baseline and you know, say a, say a record is missing a critical user ID, I might subtract a 10th of a point two-tenths of a point, depending on the how critical that ID is. Maybe it’s a second-tier ID and it’s not as important and the record is otherwise usable. Maybe it’s not usable. I still want to send the record on downstream. But with that metric attached to it, and then you calculate that on a record and then a table basis and you can begin to calculate like a daily average, a monthly average and start to build a scorecard. One of the biggest assets going that I think our analytics organ Nordstrom had was a fitness function. I think that term, maybe it’s a little Amazon or Microsoft centric, and maybe it’s fallen out of favor a little bit but it’s sort of an assessment of your programs, technical capabilities and the impact you have on the business. And when you work in analytics, that can actually be harder to do. But we were able to extrapolate a lot of performance metrics out of the test campaigns that we would run out of the Clickstream and features that we would ship. And I think that’s actually more critical to you as the IC or the manager than it even is to your team executives because it gives you one more measurable to perform to assess your performance against your OKRs.

Kostas Pardalis 35:28
That’s super interesting. Okay, let’s say we haven’t placed like amazing allegories, but like, monitor and figure out if something goes wrong. Okay. And let’s say something goes wrong. How do we debug data? I mean, as software engineers, we know that we have tools to debug our software, right? Like we have debuggers, we have different tools that we can use for that we have testing, we have many different things that they are both like to take tools, and also like engineering processes and best practices that we have learned that like they help like, reduce the risk of something breaking. How do you debug something that starts from the client of a mobile app? And reaches at some point your data warehouse and anything can go wrong, right? Between their line between like these two points. Right. So how do you do that?

Sean Halliburton 36:24
Yeah, that’s another balance you have to strike between. How much work do you want to put into making your synthetic tests appear as organic as possible? Right? We’ve tested our pipeline using tools like serverless artillery, to the point that we can accept hundreds of 1,000s of requests per second. I mean, you know, think about the news industry in general, yeah, there are planned events, like elections, but there are also unplanned events throughout the year as you go, that can drive everyone to their phones and their laptops. And can be extremely unexpected. And we need to prepare for that. So we’ve, we’ve used the output of those tests to beef up things like cross-region, failover, pipelines, things like that. But even then, you’re operating on an assumption where you’re probably using a fixed set of dummy events. So then you have to decide, okay, is it worth dedicating time maybe to pull in some developers from the mobile team to more accurately simulate how a user uses the app, as they understand it? Keep in mind their assumptions, maybe based on your own assumptions that are coming through your pipeline based on the data that you’re serving to their analysts. But yeah, you could certainly go down a rabbit hole and put a lot of work into automating tests from different platforms build the device for me, then it’s just about it’s a matter of how far down that hole you want to go and how much you want to invest.

Kostas Pardalis 38:03
Makes total sense. And my last question, and then I’ll give the microphone back to Eric, you mentioned at some points, like more senior people that are involved in his like, VPS, like the leadership of the company, probably people maybe don’t even have like a technical background, like they cannot understand what delivery semantics are or the limitations of technology that we have and all these things. Then at the same time, you said like, it’s very important to make sure that you can communicate these things to them. Do you have some advice to give around that like how you can communicate effectively to your intercepting the limits of the data, how much they can trust it, and the limits of the technology and the people that we have?

Sean Halliburton 38:52
That is a primary function of the engineering manager naturally, but no engineering manager can do that alone. So my advice to those considering management is before you accept such an opportunity, and it may be a fantastic opportunity, but do what you can do ensure that you can have backup in place have insist on program management help because you as an EM are busy managing not only the careers but the lives and even the mental health of the technical talent that you worked hard to get in the door. And you can’t be in every meeting. You can’t be in every scenario you can’t cover every hour, you’ll burn yourself out if you try. Just like I would also recommend insisting on a product manager because there are already 100 different technical ways you could take a product and 99 of those might not meet the actual demand of your internal users downstream. And you know, in good engineering, we talk mostly about internal users as our customers. But I do believe that extends to the external end-user as well. But similarly, there’s only so much you can do as an engineering manager to evangelize for the system you’re building and to canvass your users on how it can be better and how you could better serve them. So those are really what I call the minimum three legs to the stool that you need to build an effective data engineering team, and to meet those requests that are only going to mature as your consumer teams mature. And as the business matures around you.

Kostas Pardalis 40:48
Thank you, I would give I think, some amazing ideas and advice from you, Eric. Well, on to your questions.

Eric Dodds 40:57
I have so many actually that we don’t have time to get to because Brooks is telling us we’re close to the buzzer. The SDK conversation is absolutely fascinating. So we’d love to have you back on and talk more about that. I have two more questions for you, both kind of quick, I think, maybe not. Maybe I should stop saying that because usually you have a really wide purview as sort of a buyer user researcher of tools, you have a bias towards open-source tooling but you also use third-party SAS that isn’t open source. What are some of the tools, whether you use them or not, that you’ve seen in the data and data engineering space, that are really exciting to you that you think sort of represents like, Okay, this is kind of the next phase of tooling that’s going to be added to the stack?

Sean Halliburton 41:48
Yeah, again, first and foremost, I would, I would save the data observability and data quality tools, just because again, they have such a direct impact on quality of life, for your engineers, and by extension for their leaders. And it’s such an obvious need, that just went on mat for so long. And I’m not even exactly sure why that is. But I’m very happy to see that we’ve brought principles of Site Reliability Engineering, into the data engineering space. You know, it’s like everything went into the cloud, and all of your sis admins became SRE s, and now a lot of those SRE s are starting to look toward the data space, or the data space is starting to look to those SRE is to say, Hey, can you help us out and make sure that this thing stays up? Because if the data is gone, it’s gone. There’s no way to get it back? Yep. So that’s one thing I’m excited about. I’m also, I would say carefully excited about machine learning. I think ml like blockchain is one of those things that it’s easy to say, oh, we should be doing this without really stopping to ask, Do we really need it? So much like, you know, some of the data that you might think you want to capture, I would, again, suggest, think carefully before you apply machine learning to a problem that might not yet be there. But that said, I’m, I’m excited to see where continues to go in the next few years. I hope it comes with a dose of ethics. Because I think that is critical to the process. And AI that may be you know, once data observability improves AI. Ethics improvements are right on the heels of that. But executing ml models is still fairly complex. But I think that will improve over the next couple of years and become more closely integrated with the data stack. And I think in terms of applications of all of this of ml and personalization, I think I’m most excited about the health space, which I have not worked in personally. But I think it has the biggest impact, simply because you have the greatest diversity of end users there. And it’s one of the most complex problems spaces, obviously. And, you know, with our health system, as it is, it’s tempting to try to make an end-around to try to deliver some of those solutions that kind of break down the silos. So I hope we can continue to do that in a responsible way as well.

Eric Dodds 44:40
Very cool. I am 100% aligned with you on all of that. And actually, I’ve been writing a post that hits on some of those direct points. Okay, last question before, we’re at the finish line. But you’ve been an individual contributor and engineering manager that have worked in a variety of roles. Maybe give your top two pieces of advice. And maybe we could give a piece of advice to like an individual contributor who sort of aspires to be a manager, and then maybe give a piece of advice to a manager who’s early in their career working with a data engineering team.

Sean Halliburton 45:21
For the IC that is considering management, I would say it is absolutely very rewarding career change. But it is a big change and the role, it’s not as simple as just speaking as the most technically mature member, you know, frequently the most senior, I see becomes the manager out of necessity. And it looks on paper like it’s a very natural transition. But that’s not necessarily the case. It’s a very different skill set. But you do need to be able to speak to what’s being built and delivered. But you are coming at it from a very different direction. And the way I like to think of it is your role is as much a product manager have technical careers, as you are a manager of a complex technical system that’s hosted somewhere and rendered somewhere else. Come with humility, be prepared to listen, ask more questions, then you make pronouncements. And I think that’s a good transition to the person that is already a new manager is, again, expect to do much more listening. ask tough questions when the time Warren said, but you’re coming to learn from the people that you’ve retained or recruited to work for you leverage them to to be your experts. Don’t try to be the smartest person in the room, you are there to hire the smartest people in the room, and to be able to send them into rooms that you can’t reach because you’re overbooked, and you will be overbooked.

Eric Dodds 47:17
My friend, that is some of the best practical advice for management I’ve ever heard and is so true. Well, Sean, this has been such a fun episode. I’ve learned a ton and we just thank you so much for your time and sharing all your thoughts and wisdom.

Sean Halliburton 47:34
Absolutely. I’m happy to help anytime.

Eric Dodds 47:37
I feel like we could have talked for hours. And that’s such a cliche saying now because we say that every time and really it’s true. I’m going to pick something really specific as my takeaway, whenever you hear about a data engineering team building their own SDKs, to me, that’s an eye-opener because I don’t come from a software engineering background. But I know enough to know, that’s a pretty heavy-duty project to take on at the scale that they’re running at, you know, a company like CNN with, you know, traffic volumes that they have. I mean, building a robust SDK is no joke. But the more I thought about that, after Sean said it, I just kind of reviewed my mental Rolodex of hearing that, and I realized, you know, it’s really not the first time that I’ve heard of a large enterprise organization, building their own SDK infrastructure, in large part because the needs that they have to serve for downstream consumers to Shawn’s point, is so complex. And so, you know, even if you take something off the shelf and modify it, you end up with something that’s pretty different than the original, you know, SDK, the G add anyways. So that’s just fascinating to me. And it’s, it’s pretty fascinating. Also, I think, to just consider a situation where building your own SDK, you know, is the right, you know, the right solution.

Kostas Pardalis 49:07
I totally agree with you, I would say that I keep two things from the conversation we have with him. One is the concept of contract that comes from building API’s. I think it’s a very interesting way of thinking in building also data contracts, and what their contract would look like, or how it can be implemented and what we can learn from building these services all these years. And use this knowledge like also in the data space. That’s one thing and the other thing is, I think, by the end, he gave some amazing advice on how to be a manager, which is I think it was super, super valuable for anyone who is interested in both becoming a monitor but also interacted with monitor which pretty much everyone right. So that was also amazing.

Eric Dodds 49:57
Yeah, I agree. And I think you know, I mean, he said that In the context of data teams specifically, but really just great advice in general, so really appreciated that. Yeah. All right. Well, thank you for joining us on the data sack Show. Tune in for the next one. Lots of great shows coming up.

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 76:

Why a Data Team Should Limit Its Own Superpowers with Sean Halliburton of CNN

February 23, 2022

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter