Episode 21:

Data Integrity and Governance with Patrick Thompson and Ondrej Hrebicek from Iteratively

January 20, 2021

On this week’s episode of The Data Stack Show, Kostas and Eric are joined by the co-founders of Iteratively, CEO Patrick Thompson and CTO Ondrej Hrebicek. Iteratively helps companies know that their data can be trusted by helping capture clean, consistent product analytics. Today’s conversation digs into the behind-the-scenes of Iteratively and how trust in data can help accelerate the velocity of an organization.

Notes:

Highlights from this week’s episode include:

Patrick and Ondrej’s background and the biggest problem Iteratively addresses (2:50)
Why some companies still use spreadsheet schema management and the potential pitfalls they’re setting themselves up for with this (4:39)
Defining schema in the context of data (7:02)
Viewing the process as a team sport (11:34)
Identifying common mistakes and implementing best practices (13:46)
A walkthrough of Iteratively (17:13)
Utilizing a JSON schema format (26:58)
Laying Iteratively on top of or integrating it with an implementation for analytics (30:36)
Entry point into organizations (33:02)
Organizational change and velocity realized after implementing Iteratively (36:04)
What’s next for Iteratively? (42:47)

The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:06

Welcome back to The Data Stack Show. I have actually been spending time creating content around our data governance API, Rudderstack. And that makes me really excited to talk to our guests today. We have Patrick and Ondrej from Iteratively. And they provide really cool tooling around data governance, I think a couple of the things that I’m interested in is, first how the tool interacts with current analytics setups, you know, when you really talk about data governance and sort of adjusting the way that you’re doing things, you run into instrumentation problems. So from a practical standpoint, just interested to know how they handle that. And then also interested to know, they have a couple different people that they probably serve within an organization. Right? Data governance, as we’ve heard before, in our conversation with Stephen, from Immuta really crosses the organization across many different roles and teams. So those are my two questions that I want to make sure I asked but Kostas, what are you thinking about in terms of data governance and the Iteratively team?

Kostas Pardalis 01:13

I’m very excited to have them today on our show, mainly, because as we have seen in the past, data governance is a very big thing. So many different things that need to happen in order to like to implement data governance. We talked about access control with Immuta. And now we are going to talk more about data quality. And that’s something very interesting for me from a product perspective, but also from a technology perspective, because a very fundamental part of data quality is how we can describe our data, how we can attach syntactic and semantic meaning to this data, and how we can track changes from that. And also, one more, how we can connect these syntactic and semantic meaning with business goals. Because we don’t do technology from technology, we do technology, because we’re trying to achieve something, right. So I think it’s gonna be very interesting, both from a product perspective, and also from a technology perspective, to see what kind of technologies they use, how they present this information, how they track this information, and also how by implementing all that stuff, the organization is getting value at the end. So let’s dive in. And let’s start with them.

Eric Dodds 02:25

Sounds great. Let’s do it. Patrick, and Ondrej from Iteratively. Welcome to the show, gentlemen.

Patrick Thompson 02:33

Eric, thanks for having us. definitely excited to be here today.

Ondrej Hrebicek 02:36

Good to be here, Eric.

Eric Dodds 02:36

Great, well, why don’t we do just a quick intro, we like to start with each of you could just give a brief background, and then just tell us what Iteratively is, and the problem that you’re solving.

Patrick Thompson 02:50

Perfect. Yeah, I’m happy to start on our side. My name is Patrick. I’m one of the co-founders and CEO of Iteratively. Working on Iteratively now with Ondrej for the last two years, and previous to that was on the growth team at Atlassian for four years. And then before that had the opportunity of working with Ondrej at his startup. At Iteratively, we’re solving, you know, one of the biggest problems that we heard from six months of customer discovery with different software teams, which was companies not trusting the data that they’re capturing primarily because of human error. We solved that by really, you know, trying to centralize the tracking plan within these organizations and making it really easy for teams to collaborate and schematize the data that they’re capturing.

Ondrej Hrebicek 03:27

And yeah, I’m Ondrej, nice to meet everybody. I’m the CTO and co-founder at Iteratively, around the product team here. And like Patrick mentioned before, that I was a co-founder of a company called Syncplicity, also in the data space. And after that, Microsoft.

Eric Dodds 03:41

So we have lots of questions about the product. And actually, Patrick, you mentioned the word trust, which we had talked about with a previous guest around data and data governance. So I definitely want to dig into that and I know Kostas has a bunch of questions. But I have just a question coming from my background, because I’ve done the sort of spreadsheet schema management thing for many years, and when we started talking with you about being on the show, I just couldn’t help but think that it’s crazy that it’s now 2021, I guess, and still, so many teams are doing the, you know, shared Google Sheets schema management thing, which just seems so wild to me, because I mean, it’s primitive, really, with how advanced we can get with software, but you talk with customers who are using your product every day. Why are we in a place where companies still haven’t moved past the spreadsheet for this?

Patrick Thompson 04:39

Yeah, no, great question. I think really, generally speaking, we were actually surprised by this as well. I mean, we spent a ton of time interviewing these companies and the pain came up time and time again. And yeah, I mean, it was either a Confluence page, a spreadsheet, a Notion page, you name it. But the reality is that teams just didn’t have good tooling available to solve this and the problem becomes so acute at some point where you grow and the state is you know, revenue driving at the end of the day. So a lot of teams are solving this in house by building out their own internal tooling. The vast majority of teams out there today are yes, simply relying on a spreadsheet or nothing at all, which inevitably leads to a lot of human error within the process. I think generally it just comes to like a lack of knowledge or foresight, like once you’ve been bitten by this before and have to suffer the consequences of bad data. It’s definitely something that you look to solve during implementations and as part of your process moving forward. But yet, most companies don’t intuit that there’s a solution beyond a spreadsheet for documenting and collaborating around their analytics.

Eric Dodds 05:39

Yeah, yeah, I mean, I’ve seen situations where it gets so bad that you literally just start over, because fixing it is way, way harder than actually just starting with a clean slate and doing it right, you know, from the get-go, which is pretty wild.

Patrick Thompson 05:55

Yeah, 100%, we saw it. We’ve talked to a lot of companies, I think Airtasker comes top of mind for me, where they had to pause their entire development roadmap for six weeks to, unfortunately, kind of throw the baby out with the bathwater, but to re-implement and architect their data model and start from scratch, because none of the data that they’re capturing was reliable. It was a huge, huge shift for them. But definitely something that, you know, when it comes to being valuable for the business, data is kind of the lifeblood of most organizations these days. So something that was definitely worth the investment for them.

Kostas Pardalis 06:26

A quick question, I think before we move forward, and diving deeper into the products, let’s discuss a little bit more about what schema is. I mean, I know that among us, like it’s a term that it’s very easy to understand and communicate. But the schema is something that it can mean many different things in technology, right? And in data in general. So what is schema? What do you consider us the data schema in your case?

Patrick Thompson 06:52

A great question, I actually passed this puck to Ondrej for helping kind of define some of the personas that we typically work with, and then how we think about schema in the context of data.

Ondrej Hrebicek 07:01

Yeah, definitely. Patrick, I mean, it’s interesting, because that’s the word schema really isn’t used very often, in this particular space. People call this definition of analytics, data tracking plan, data plan, measurement plan, there’s all sorts of terms, but schema usually isn’t, isn’t used. And we, we actually don’t use it that often ourselves either for that, for that reason, at the end of the day, it is schema, and it’s actually represented a schema under the covers. But what it really is, is the structure and the definition of the analytics data that you want to capture and send to your analytics destination. It’s the names of the events, the internal structure of those events, such as the attributes or properties that are attached to those events. And, and the types of those properties. Are those properties, numbers, or strings, or true or false values? What are some of the restrictions or rules on those values of those properties? That’s all embodied in the so called schema, in order to not just define what the structure is, but then potentially enforce it when the data is actually collected and do some other interesting things that we’ll talk about a little bit later, such as, you know, generate code that matches that schema and helps developers instrument analytics.

Kostas Pardalis 08:13

That’s very interesting, because I think that everyone who’s been involved in technology in general, that their word schema is usually like more associated with something like a database, which usually is in your use case, it’s probably, let’s say, the destination where the data arrives to, is delivered to. But what we’re talking about here is actually how we can have a schema and how we can enforce or monitor the schema at the source where the data is created, where the data is generated. And that’s it right?

Ondrej Hrebicek 08:40

Yes.

Kostas Pardalis 08:40

And so that’s a bit of like, let’s say, a shift in terms of the perception of where the schema is implemented. So what’s the value of doing that on the source? Why it’s important actually, to do that, and we cannot just do it on the destination, like on the database, and finish the work that we have to do around the semantics of the data and the structure on it there.

Ondrej Hrebicek 08:59

Right, right. Good, great question. Because, in fact, technically you can, and we’ve, as an industry have been doing that for a long time, companies like Segment have been around for a while, and others, that let teams send really arbitrary information to arbitrary analytics data to the back end, and then they try to take this data and then represent it in a schematized store. And usually, they have to go through a lot of contortions to make it all work and have the schema catch up with the data that’s coming in. And it leads to a lot of problems and leads to a lot of data quality and data management problems, data analytics problems as well, which is really the big reason for why thinking about schema and structure way ahead of the ingestion, when the data is actually being instrumented and captured is so important. It means that the knowledge around what is going to be stored in a particular database and what the schema of the databases is going to look like is known ahead of time and the whole team can same page about what’s being captured, what is it going to look like when it’s persisted in a in a data warehouse, and then how can we then analyze the data, given a structure that we’re all behind and all aware of.

Eric Dodds 10:13

Yeah, I think the one thing that I’d add also is kind of when we think about how teams operate and analyze data, putting the effort ahead of time to define your tracking plan to find the analytics events that you actually do want to capture is super helpful versus trying to take a reactive approach to kind of cleaning data or capturing data being as proactive as you can. Solving this in the source is very much a best practice when it comes to actually getting data that you can use and consume across your entire data stack.

Kostas Pardalis 10:46

It’s super interesting, guys, actually, we keep mentioning things about the whole lifecycle of the data, right? We started talking about storing the data in the data warehouse, the database, the point where like, the data is captured. And I assume that like, there are many different roles, and many different people who are involved in this data, let’s say, lifecycle. So who are the people who are involved? Who are the roles? And at the end of who should care about the schema? And after that, we can discuss a little bit also on how you see who should govern the schema? Who should define it? Who is implementing it, but I get the feeling that there are many people involved in this. And I think this is quite interesting to learn a little bit more about.

Patrick Thompson 11:34

Yeah, Kostas, great question. We definitely view this whole process as very much a team sport, generally speaking, when it comes to the definition of analytics, the instrumentation of analytics, and the validation or verification of the data downstream as well. So that involves typically folks like data analysts, data scientists, product managers, defining kind of what are the success criteria of the features that they’re working on, or the experiments that are shipping, what needs to be captured in order to analyze that effectively, all the way from the engineers actually having to write instrumentation code to actually capture that data as part of their their work to, you know, downstream data engineers who may have to update or maintain the data pipelines or the data warehouse to actually deprecating and collection of that as well, right. So generally speaking, if the data is not being used, how do you remove that data? Is that data still something that’s worth capturing? What is our risk profile for actually maintaining and storing that data long term, from everybody from security, legal and compliance organizations to typically the governors within bigger organizations that we might be working with? So yeah, it’s very much dispersed across the organization. Typically, the folks that we work with quite often are the product organizations as well. So yeah, typically, your PM, your data analysts and your engineers, really trying to create analytics, you know, integrate analytics into their software development lifecycle.

Eric Dodds 12:55

Patrick, one question before we … because I wanted to get dig into some of the technical details as well … but I’m just thinking about the situation that we mentioned earlier, around things getting pretty bad, you know, just from a data quality standpoint. What are some of the common things you see, and I’m thinking about our listeners who may be in a role where they’re working on this, right, or this is part of their job? And we hear this all the time with, you know, data engineers, or developers, especially earlier stage companies, what are some of the things that you see are the best practices, maybe just a couple practical things for people saying, man, I don’t want to get into that place and I feel like I have the opportunity right now, to do something about it. I mean, other than signing up for Iteratively, to make the process easier, what are some of the big or most common mistakes that you see?

Patrick Thompson 13:46

Yeah, not having an owner is probably the biggest one. So like having, you know, a central gatekeeper or somebody to own your tracking pond. Regardless if that’s an interoperability, or if you are using a spreadsheet or some other other type of solution. Having somebody that really maintains the quality, the understanding of this, you know, shepherds that documentation throughout the organization is definitely something that’s very important, and then creating consistent standards around a taxonomy. So what are your naming conventions like? How are you representing this data? Being able to pull in folks like the product manager and the analyst and having a conversation around what we should be tracking, those meetings are super critical to the success of getting good clean quality data into tools like Rudderstack or you name it. And generally speaking, we tend to view, and this goes into a little bit more about how the product works for Iteratively, but we tend to view that having a single source of truth that is codified is definitely the best practice, to being able to generate strongly typed SDKs that match those conventions is very, very important. Something that we’ve seen other companies like obviously Atlassian and the Airbnbs and the Ubers of the world adopt, that really helped improve overall data quality. Other than that, like, you know, the data model for each organization is really specific. So thinking through it, making sure that it is something that can answer the business questions that you have as an organization is very important. That goes beyond tooling at the end of the day to building a culture where people can feel empowered to be able to utilize their data, and be able to ask insightful questions. So there’s definitely a cultural impact beyond just tooling within most organizations.

Eric Dodds 15:21

Sure, yeah. It’s interesting. I mean, we’ve talked to the people who’ve come from different teams, right, whether it’s, you know, on the analyst side, or the data engineering side, but you do have almost this role that acts as sort of an internal ambassador of sorts across teams, right. And so they have to interact with various stakeholders, but act as the owner, which is really interesting. And that’s, that’s a common theme. We’re hearing more and more. But that really does seem to be a key piece of making it work really well inside of a company.

Patrick Thompson 15:51

Yeah, definitely.

Ondrej Hrebicek 15:52

Yeah. The other thing I would add there, Eric, is in terms of the kind of problems and best practices, there’s definitely the aspect of the schema. Is the structure of the analytics event correct? The other aspect is is the event firing at the right time and the right place in the source code? That’s another thing that we see folks run into a lot where they think they’ve implemented analytics correctly, but it’s actually not working that way in production. And the best practice that we see best teams follow, and it’s what we recommend right now as well, is to add automation to analytics, just like you add automation for other functionality in your product and treat analytics as a first-class citizen inside your application. I feel like the industry has gone through this mindset change maybe a decade ago on the security and performance side. Now we all think of security and performance as something as a feature, something that we pay attention to anytime we share. But analytics still seems to be a bit of a redheaded stepchild here, which I think is a huge mistake, and moving it up the priority stack a little bit and encouraging engineering teams to add coverage for analytics into their unit and integration tests is paramount in our opinion.

Kostas Pardalis 16:55

That’s great, guys. So give us like a quick walkthrough of the product. Let’s say I’m a new user just signing up on the product, what should I expect? And what should I do in order to start realizing the value that I can get from the product?

Eric Dodds 17:13

Yeah, Kostas, great question. So I mean, the whole lifecycle for somebody who’s adopting Iteratively is, they’d be working on importing their schema into the tool. And that could just be from a CSV, or Mixpanel or Amplitude export, or some other type of export, they import that data into the tool to really kind of create that single source of truth. They’d invite their team into the tool as well. One of the things that Iteratively really is, it really is a documentation tool for the entire team. So we want as many folks to have access to it to be able to understand what is being tracked and why it’s being tracked within their organization. They’d invite their developers into the product as well. So a new feature or a new experiment that you’re working on, you’d actually go define your events. And you have, you know, you can think of Iteratively as kind of like GitHub for analytics, you have all the same features and functionalities that you’d be used to for collaborating on code. So you can create a new branch for a new feature and experiment that you’re running, add your new events, add your new properties on those events, assignment to developer to work on, they actually pulled down and strongly typed SDK, so all of those new events that you’ve defined, all those new properties would get included in a bundle that we generate for developers. They’d instrument it and actually verify that the instrumentation is correct. You get a lot of the benefits because of the type safety, which is built into the SDK. But we also validate all of the runtime payloads as well against the schema or the validation that you’ve actually defined inside of Iteratively. They’d update the status of the branch and merge that back in similar to how you’d be merging in a feature branch inside of Git into your mainline branch and code. And all this kind of happens really seamlessly by keeping everyone up to date on what’s happening, integrating with tools like Slack and Jira, making it really easy for everyone to have kind of insight into how their analytics are evolving, super important for us. And then we also sync that schema into other third-party tools. If you’re using an analytics tool like amplitude mixpanel, actually federate the schema there as well. So you have all the descriptions that you’ve added for your event shop and all these third party tools, which makes it really easy for data consumers to analyze, as you’re publishing new changes to your tracking pod. Anything I’m missing there, Ondrej?

Ondrej Hrebicek 19:20

You covered the main parts of it definitely.

Kostas Pardalis 19:23

Usually, based on your experience so far, how long does it take for someone to deploy the product?

Patrick Thompson 19:28

For our perspective, really just depends on the size of the organization and how much analytics they have in place today. For companies starting from scratch, they can get successful in less than a day, it’s really easy to get started and get going. Somebody has analytics in place, typically our recommendation is for them to adopt the tool like Iteratively progressively. Use it for a new feature, a new product that you might be deploying. Make sure that it really meets the needs of your team and your organization, put it through its paces and then sort of treat your existing analytics as sort of a technical event and then migrate it over time, add it to your test coverage, as well. Integrate Iteratively into CICD to really validate and give you kind of ongoing assurance that your analytics are correct. It’s anywhere from less than a day to typically around two to three weeks for most of our companies.

Kostas Pardalis 20:13

That’s quite fast to be honest. I would expect that might get longer like to do it, especially because you have to include many different roles there. And you need to set up some things yourself, like experience with the value of the product, but that’s a great indication that you are on a good track, and you’re building a great experience. So well done.

Patrick Thompson 20:35

I was gonna say the aha moment for most of these companies is when all of their events show up as green inside the tracking plan. So that’s really something that we’ve been focusing a lot of time and energy on. Depending on the size of the organization, you know, most of the folks who typically need … most of the stakeholders we’re working with are, as I mentioned earlier, kind of the data analysts and the PMs, typically, as long as everybody’s bought into analytics being important to their organization, it’s a relatively painless process.

Kostas Pardalis 21:01

Great. So, guys, let’s focus a little bit more on the schema because I have a feeling of like, the schema is a very important part of the product. I have like two questions actually, once more technical, the other is more on the business side of things a little bit. So technically speaking, what is the schema for you? And how is it defined? What kind of serialization do you use or support and develop? Like, what are usually the customers out there using? Eric mentioned at some point that many people are just using Excel sheets or Google Sheets for that purpose. And yeah, how do you consume that? Like how, how you version it, all these things around the schema itself. And also, if you can give us a bit more of statistics from your experience, like how often do schema change within an organization?

Ondrej Hrebicek 21:52

The first thing I’d say, Kostas, as far as the user of the platform is concerned, they really don’t interact with the underlying implementation of the details of the schema itself, we tried to hide as much of that as possible, because it’s not really relevant to the day-to-day operations. Behind the scenes, everything is driven by JSON schema, that’s the format that we decided to double down on for our integrations. It’s a de facto standard in the world of analytics data, pretty much all analytics data is represented as JSON today. So it was a natural choice for us. And we use the JSON schema not only to push the definitions, the track and plan definitions into other other tools like Mixpanel, and an Amplitude and Snowplow, but also to drive the validation of the data on the client side. So the schemas and the rules and the definitions that are defined in the Iteratively tracking plan are represented as a JSON schema document, which is bundled into the SDK that we co-generate. And we use just the standard best of breed JSON schema validation libraries to validate the payloads against those schemas. And if we detect anything is off, or if those libraries detect anything is off, we let the developer know right then and there that there is a problem. The layer that sits on top of this is like Patrick mentioned, very Git-like, so there is support for versioning of schemas, there’s support for branching of schemas as well. And it works very similarly to how a source code gets gets versioned and branched as well, there is a way for folks to propose changes to the schema in a staging version, they can comment on those, they can collaborate around those when they’re ready, they publish a new version, which generates a new version of the of the tracking plan. And ultimately, when they’re ready to merge those changes into the mainline tracking plan branch, they go ahead and do that just like they would in, let’s say, GitHub or Bitbucket.

Ondrej Hrebicek 23:50

The other thing related to versioning, that’s probably worth mentioning is, it’s not really just the tracking versions, it’s the events as well. So every event that gets changed and published, gets a new version. We were inspired by the work that the Snowplow team did with Iglu, and specifically the schema of respect, which defines how you apply semantic versioning to schemas that represent data. And we’ve adopted that approach for our event versions as well, which lets us and our customers tell whether a particular change to an event schema is is minor, meaning it’s backwards compatible and forwards compatible, or whether it’s a major major change that will require usually changes on the back end where the data is stored in order to persist the new version of the schema correctly. So that’s the story there on versioning and branching. As to your last question, Kostas as to how frequently these change, you know, every customer is a little bit different. There’s definitely a lot of work that happens upfront to make sure that the tracking plan is correct. Initially, there’s a lot of common events that everybody wants to capture related to user identity, sign up, sign in, log out, you know, pageviews, things like that. So there’s definitely a lot of activity upfront. And then it depends on the maturity of the company. A company that uses analytics for growth experiments, and a company that cares about measuring the success and the outcome of a particular release or a particular feature will come to the tool on a weekly basis to add new branches and new events to their tracking plan for whatever new features they’re working on. Other companies that may not be quite as data mature, will come in a little bit less often and only create events when the marketing team or the customer success team has a new new analytics requirement.

Kostas Pardalis 25:46

That makes total sense. Ondrej, I had a bit of a more technical question based on the stuff that you mentioned, especially about using JSON as the internal authorization for presenting the schema. I’m old enough to have gone through many different technologies that have to do with representing schema and structure of the data and semantics around the data starting from XML schema, for example, which is something that is super expressive, right, even to things like ontologies, and all that stuff. On the other hand, we have something like JSON, right, which actually it was never intended for describing something like schema, and it’s very light weight. And it doesn’t have like the expressivity of the technologies on the other extreme. How limiting is this? And did you think that in terms of capturing and monitoring and dealing with the schema of data and trying to create this on top layer, which is more semantic about both the meaning of data and also like the structure of it? Do you think that there’s something missing there? Is JSON enough? Or do you see that in the future, the industry will come up with new ways of implementing and describing the schema for data?

Ondrej Hrebicek 26:57

Yeah, it’s a good question Kostas. As far as our use cases have been concerned, so far, the JSON schema spec has been phenomenal. And we haven’t come across any core definitions that we wouldn’t be able to represent in a JSON schema format specifically for analytics data. So we’ve been very happy with the standard and I think are using it to its full potential. There are a couple examples where we’ve had to think about extending the standard. And actually the Iglu standard, which is itself an extension to JSON schema, comes comes to mind. And we use it as well, where the ability to specify some of the metadata around the schema is not always possible inside the JSON schema format. Things like, like owners of the of the schema, or the internal name, or the display name of the event, we don’t have, or the metadata or the sources where the particular event is supposed to be captured from, there’s really no place to represent that in the JSON schema document, and you need an extension. But as far as the core structure of the event is concerned, we’ve been able to get everything that we needed from the JSON schema format.

Kostas Pardalis 28:04

That’s great. One last question for me, before I let Eric ask his own questions, we talked about the schema from a technical perspective so far. So that’s a question a little bit more of like, let’s say the business perspective of things inside their organization. How is the schema connected with my business goals, and how, let’s say, I translate these goals into the schema and track the data that’s going to be used in the future to do analyses, come up with KPIs and all that stuff. So how do you see from an organizational point of view, this connection happening inside the organization and you also as a company, because you are interested in order to communicate the value to your potential customers? How do you communicate this?

Patrick Thompson 28:53

Great question. So the way that that typically happens is pretty messy, actually, within most organizations, but quite often, it happens within more planning meetings, typically within the team. So imagine you’re working on a software team, and you’re releasing a new experiment, typically you would define both your macro and micro success metrics for that piece of work as part of the spec that you’re publishing, before any engineering work actually kicks off. As part of doing that, then you’ll break that work down into, you know, tickets that your engineering team might have, you’ll break that work done to kind of the micro level goals that you have for that work as well. And typically, it’s at that point where Iteratively comes into play, where you’re actually defining the actual representation of that schema inside of our tool. And linking those changes into kind of your product spec sheet whether or not that’s in Confluence or Notion or some other type of tool. Yeah, typically today, those two those two pieces happen separately.

Eric Dodds 29:49

Question and this is more practical, just thinking about being a user. So let’s say I’ve already had an analytics implementation, right. So I’ve, you know, implemented you know, whatever Segment or a direct implementation with Mixpanel or what have you. So could you just walk us through? Because when you talk about this stuff well, when we talk about specifically using Iteratively, it almost sounds like, is this kind of replacing the instrumentation that I’ve already done? How does that work? Could you just give us a really practical sort of technical run through if I already have, say, a Mixpanel implementation instrumented for analytics, how, from a technical standpoint, do I lay Iteratively on top of that, or integrate it? What does that look like?

Ondrej Hrebicek 30:36

Yeah, there’s, there’s two ways that we advise customers to deal with this, Eric. One is to just go in and yeah, take out the kind of ad hoc Mixpanel instrumentation that you have in your product and replace it gradually, progressively, like Patrick said, with Iteratively. So it’s what most of our companies do, they, they come in, they sign up for Iteratively, they figure out what their most important events, what are the key events that they’re tracking about their customers are, they define those in Iteratively, they create a new version of that tracking plan, and they ask the developers to migrate those events over to Iteratively and get the strong typing, get the CI checks, and the testability support for those events. And then they treat the rest of the events as technical debt that gets chipped at, over the next, you know, couple of weeks, couple of months, couple of years, really depends on the company. The second thing that we’re working on, right now is the ability to effectively audit and inspect the instrumentation that you have in your product today through an SDK like Mixpanel. So we have an SDK of our own called Audit SDK, that will effectively hook into Mixpanel or Amplitude or Segment’s SDK and monitor the events that are being sent for those over to Iteratively. And compare them to the tracking plan. And if we spot issues, events that aren’t defined in the tracking plan, events that are being tracked differently, or with different property types, we would alert the tracking plan owner and let them know that hey, something is wrong. And that’s usually the idea here is that the development team is going to get so fed up with all the problems, then all the issues that are being raised by the either PM team or the analytics team that they’ll just go ahead and implement the the Iteratively SDK in the product, or at least accelerate the implementation of the SDK so that they are in sync with the tracking plan, and these problems just don’t arise.

Eric Dodds 32:23

Sure that makes total sense. Yeah. It’s kind of the diagnostic approach, right? You may not know how sick you actually are. So you may, you know, you may be less willing to do surgery on the solution. That’s super interesting.

Ondrej Hrebicek 32:35

Definitely.

Eric Dodds 32:35

And you said you work with, you said you work with a lot of product teams, could you talk a little bit more about other teams you work with? Is that your main entry point? You know, in terms of people who are interested in using it? Or do you also sort of begin conversations with analysts and it sounds like the devs sort of get involved after an internal stakeholder who needs higher quality has raised the conversation?

Eric Dodds 33:02

Yeah, our main entry point within most organizations is the data analyst or the analytics engineer within those teams, followed by if it’s more of a, you know, the head of data or VP of analytics as well. Typically, they’re the ones that have the most pain real related to, you know, data munging, or data quality. So they’re the ones who are actively looking for solutions. And then they typically introduce us to the product manager could be a data PM or the head of product within those organizations and the engineering team for some kind of solution validation.

Eric Dodds 33:33

And one follow up question to that, and this is more of a, you know, Kostas and I work on product all day, every day as well. And so this is more of a selfish question. But I think it’s interesting that you have these various stakeholders. And even on your website, you sort of have, you know, different personas, who are stakeholders involved. How does that influence the way that you think about building the product and sort of feature prioritization? Since there are multiple people involved in the equation, I mean, even down to interface decisions, and other things like that, would just love to know about how you think through the product development process with a couple of different personas, who are ideally using the product.

Ondrej Hrebicek 33:38

Yeah, it’s really hard, Eric, it’s hard to. Honestly, I wish I had a better answer for this one. I think that the decision that Patrick and I’ve made that we have to … it’s kind of like air or water, we need both. So we’ve got to make the data analyst the person who’s responsible to be kind of the gardener for your tracking plan successful with the tool. So the interface that they are interacting with on a daily basis, the website, the support for, you know, for branching and kind of shepherding and curating the tracking plan must be top notch. And then the developers have to have a great experience as well. Right? They are key to a successful implementation of the analytics platform. You need both in the picture, and yet we’ve had to kind of struggle there a little bit and straddle both sides of the fence here to build a product that does support the organization as a whole.

Eric Dodds 35:06

Sure. And this is, this may sound like an interesting question, but talking about stakeholders brought it to mind, I’d love to hear any stories you have around, you know, so let’s, let’s quickly talk about the personas that we just mentioned. So we have, you know, an analyst and a developer. And so the analyst has a major data quality problem, they find Iteratively, to get adoption, they work with the developer to implement it. I’d love to hear any stories you have about how that’s impacted the organization beyond just those stakeholders, because they sort of see the, you know, the sharpest end of that, right, and that they’re solving a problem that’s making their job really difficult, every single day. But when you get two or three layers removed from that in an organization, say, you know, you go to, you know, a VP or even someone on the executive team, have there been any stories where the work-around inequality has sort of reverberated around the organization at a pretty wide scale?

Patrick Thompson 36:04

Yeah, definitely. I think the two main drivers that we typically see, as far as uptick when Iteratively is introduced for an organization is one as a lot of the data quality issues, purely from a human error perspective go away, which is, you know, easy enough for these organizations to quantify the number of data bugs that are being raised in Jira. And the second one is kind of speed of instrumentation. So there’s a lot of time savings involved when you have a tool that can really manage this process and cut down on all the kind of back and forth between between stakeholders within this organization, that typically we see a lot of folks who have Iteratively in place and this, this process adopted, which is a lot easier for them to get their analytics tracking code shipped to production. At a higher level, you know, we think about the reporting aspect within most of these organizations, we tend to get really good qualitative feedback from, like our VPs of data that like, hey PMs, are starting to take ownership of tracking and starting to get integrated more into our release process. We have PMs who want to actually look at the data when they’re shipping features. So we tend to see this more organizational change. It’s harder to quantify when it comes to kind of the business value that’s being derived from that. But yeah, definitely more of an organizational change, when folks are actually planning and thinking about tracking upfront, you know, before the feature actually ships, quite honestly, most organizations tend to ship work and think about analytics as an afterthought or their CEO comes down and asks, hey, how did that release go, and you have to kind of scramble to get some data, which isn’t, which isn’t a great way of working.

Ondrej Hrebicek 37:34

But what’s been really interesting, Eric, as well is just, and this was surprising to me in the early days is just how much of the organization actually cares about analytics data beyond the PMS and the analysts directly. So not going through the analytics or through the data team, the analytics data that folks captured doesn’t just get sent to Mixpanel, or Amplitude, right? You know, this firsthand, it gets sent to many other tools these days as well, marketing automation tools, sales automation tools, customer success tools, as well. And it’s those directors and heads of those departments that are actually raising their hand and pushing for quality as well, so that they can rely on this data for doing their job well. And that used to I think, not be quite the case in the past, but it’s coming up more and more often. And when we look at the customers that we’ve probably made most successful, it was the rest of the org that supported the Iteratively deployment and was excited about the possibility of actually being able to influence the tracking plan and have a say in what’s being captured in a collaborative way, and then actually be able to count on high quality data to hook up to email campaigns, for example, and drift campaigns and things like that.

Eric Dodds 38:49

Sure. Yeah, it is really interesting. The one theme, and Patrick, you may have mentioned this term, specifically, so forgive me, if I’m repeating it, but one thing that both of you have talked about that’s a byproduct of having accurate data is velocity, which is really interesting. You don’t necessarily, when you think about the word trust, like I trust my data, I trust my, you know, data team, that they’re giving me the right information, you don’t necessarily immediately think, oh, that translates to moving faster, right. But if you don’t have to think twice about making decisions on the data, that’s a really, really big deal for a company. And this is funny, just thinking about, you know, sort of the classic startup wisdom, everyone talks about moving really fast, but you don’t necessarily think about trust in data as a major lever you can pull in order to increase velocity. So it’s really neat to hear that that’s sort of a major consequence of Iteratively being part of an organization.

Patrick Thompson 39:52

It definitely goes two-fold as well, right, like, I think back to my days on the growth team at Atlassian, where we’d ship experiments only to find out six weeks later that we actually forgot to have all the tracking in place. And we weren’t able to analyze the success of those features. And having to relaunch the experiment means that that was six weeks of missed learning opportunity for the business right? At the end of the day, like those who win in the market are going to be those who are best enabled to kind of have this high-velocity model of being able to take these learnings and actually derive insights and influence decisions. So analytics and data trust are huge factors of that at the end of the day.

Kostas Pardalis 40:33

So we are close to the end of our conversation, which is super interesting, by the way. And I have one final question for you. We keep talking about trust around data, and how important it is. And of course, how important is the role of productivity for increasing the trust in data? Based on your experience so far, because okay, you’re, I would assume you’re an expert right now, in terms of trusting data, what is missing right now to increase even further the trust that we can have on the data that we are using in an organization? That’s one question. And as a continuation to this question, what is next for you? Like? What are your plans? How are you going to keep delivering value? And what is exciting coming in the next couple of months from you?

Patrick Thompson 41:24

I can take the data trust one, and I’ll hand off the last one to Ondrej. I mean, specifically, when it comes to data trust for us, like we think of this as very much like how do we get the entire organization bought into this way of working, where you think of analytics as integrated into your SDLC. Generally speaking, we want to make it so that analytics are less of an afterthought. So for us, it means living where customers are living in the tools that they’re living in. So like building out great collaboration and workflows with tools like Slack and Jira and other types of solutions that teams are operating in. And then making it really easy for people to have a good understanding of the data and how this data is being used and consumed within the organization, being able to tie in your tracking plan into potentially the report that’s being consumed in a tool like amplitude, or pulling out more of an integration with something like DBT, or even directly linking into your reporting tools, like Looker. All this stuff is very much for us trying to create this, you know, single source of truth around your analytic schema and the ontology that as it evolves over time, Ondrej, anything you’d add there?

Ondrej Hrebicek 42:28

Yeah, Kostas, I would add a couple things to what Patrick said. I mean, we, we want to build the best collaboration tool for tracking plans. At the end of the day that’s our big mission. And we started with making the experiences as nice as possible for the governors, we call them gardeners sometimes, and the developers in order to improve analytics health overall inside an organization. I think what’s next for us is doubling down on that. But also, you’re adding more integrations both on the destination side, the analytics destination side, as well as the schema sync side. And then adding some analytics health and monitoring for products that have had the SDK instrumented in them as well so that we can give folks a more holistic view of not just what’s implemented, but is it actually doing the right thing in production? Is it actually working the way that they had expected when they created the tracking plan? That’s a big thing for us going forward.

Kostas Pardalis 43:19

Ondrej, Patrick, thank you so much for the amazing conversation today. I hope you enjoyed it as much as Eric and I did. We’ve learned many interesting things, data quality and schema management. I think it’s a very important part of every data stack that is built out there. And I’m pretty sure that there are very exciting times ahead of us, especially for you and your company. So let’s speak in a couple of months again, and see what’s new, what has changed and keep up the discussion. I’m pretty sure we’ll have even more things that you can tell us about.

Patrick Thompson 43:54

Definitely. Kostas, Eric, thanks so much for having us.

Eric Dodds 44:03

Well, that was a really interesting conversation. I think one thing that stuck out to me was a common trend that we’re seeing, and that is the concept of trust as it relates to data within an organization. And we’ve just heard different ways that trust impacts an organization. And it was really interesting to hear the guys from Iteratively talk about the way that they see, you know, trust around data impact various teams, and then the organization as a whole. So that just theme, since we’ve heard it multiple times really stuck out to me. What jumped out to you, Kostas?

Kostas Pardalis 44:39

Yeah, absolutely, Eric. And the data governance is all about trust. And in order to put the right trust there to your data, you need many different things to happen at the same time. One is access control, as we said, at the beginning, but data quality is also super, super important. Third reason without the right mechanisms there to understand when we can trust and when we cannot trust the data anymore. And as a second step, also try to figure out what went wrong and how we can fix it. I think it’s of paramount importance for defining and implementing really robust data supply chains and organization. In terms of using excellent trust to solve at least part of this problem, it was very interesting for me to hear how they do it, the technologies that they use, how they build that on top of like, JSON, and how they perceive actually, and deliver these product for productivity for the whole organization, but for the developers also. So we have all these SDKs that are strongly typed, where they can help the developers avoid many bugs and going back and forth, fixing the problems. And at the end, if we fix that, we can also end up having a very robust quality assurance mechanism besides the organization. So yeah, and still quite early, right. I’m pretty sure that like in a couple of months, if we talk to them again, they will have even more exciting things to share with us.

Eric Dodds 46:12

Absolutely. Well, we will catch you next time on The Data Stack Show. Subscribe to get our weekly episodes in your podcast feed.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 21:

Data Integrity and Governance with Patrick Thompson and Ondrej Hrebicek from Iteratively

January 20, 2021

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter