Episode 115:

What Is Production Grade Data? Featuring Ashwin Kamath of Spectre

November 30, 2022

This week on The Data Stack Show, Eric and Kostas chat with Ashwin Kamath, Founder and CEO at Spectre. During the episode, Ashwin discusses data quality, monitoring alternative data in the finance industry, the complexities of managing the accuracy and security of that data, and more.

Notes:

Highlights from this week’s conversation include:

  • Ashwin’s background in the data space (2:43)
  • The unique nature of working with data in finance (7:32)
  • Technological challenges of working in the finance data space (13:55)
  • The third-party data factor and judging if it is reliable enough (17:07)
  • What made Ashwin decide to go out and build his own company? (31:47)
  • Defining data decay and data storing and why both are important (37:52)
  • Advice on the importance of data quality (42:10)
  • Final takeaways and wrap-up (50:49)

 

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:03
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Welcome to The Data Stack Show, Kostas. We love talking to data professionals who work in industries where they have certain requirements around the data and Ashwin from Spectre data, has worked in the finance industry for a really long time, at multiple different types of companies. So consumer loan to hedge fund, and now he’s started his own company. And needless to say, people who have done that are generally extremely intelligent. So I know it’s gonna be a good conversation. I actually worked for a company called a firm who sort of was the first big player in financing purchases online, and getting really sort of rapid approvals, if you will. For items that are not like buying a house, you’re buying a computer or something like that, or even stuff that’s not even that expensive. I’m really interested to ask him about that a little bit, I just want to hear a little bit about that. Because I kind of remember when a firm started showing up on all these websites, and you could finance these purchases for smaller amounts. So I’m going to just entertain myself, I’m gonna ask him like one or two questions about that to satisfy my curiosity. But of course, it’s about Spectre. So what are you interested in asking him about Spectre?

Kostas Pardalis 01:43
First, I will, say, asking Ashwin to share some of this knowledge about how data is used or labeled. I like the unique challenges of working with data in the finance sector. I mean, it’s heavily we would say data driven sector rides with its own unique challenges. And so I will start from there and then talk with him about how she decides to build spectra and what’s peripheries. Right. So let’s do that with you.

Eric Dodds 02:20
Yeah. I’ll, I may, I may make God by sealing your question about the finance status. So apologize in advance. All right. Let’s dig in and talk with Ashman Ashwin, welcome to The Data Stack Show. We are so excited to chat with you and learn from you.

Ashwin Kamath 02:37
Great to have you guys. Great to be here, guys. It’s very nice to meet all of you.

Eric Dodds 02:43
Okay, give us your background you spent a ton of time in, in finance. So give us our story, but also how you got into data in the first place.

Ashwin Kamath 02:53
Yeah, so my name is Ashwin. I am the CEO and founder of a data platform company called Spinnaker, which I started about a year ago. And I’ve been in the digital space for close to a decade now. We used to work at a FinTech company out in San Francisco, called a firm as a buy. Now pay later company where I used to deal with data both on the underwriting side building models to figure out whether or not someone is both credit worthy, and whether they on a fraud side, whether they are who they say they are, as well on the back office side with reporting and funding of the loan portfolio. And then in 2018, I moved out to New York, where we’re I’m currently based to join a quantitative hedge fund called two sigma, where I used to work on the alternative data portfolio of basically bringing in enormous amounts of data from external third party sources, putting that to use within the trading engines, everything end to end from cleaning up data standardization, building the underlying data infrastructure to make sure all of this is like working and flowing, preparing the data for research ready purposes, taking final research and analysis, putting that in the system, making sure that that’s being computed on an ongoing basis. And finally, kind of layering all of this with a data quality system that makes sure that the data as it flows between different stages of the pipeline is in a good and healthy state for the trading systems.

Eric Dodds 04:25
Wow. So deep end to end experience across the entire pipeline, I want to ask about you, you’ve done so much in finance. And so I want to ask about the specific nature of working with data and finance. But this is just a personal curiosity. So I remember when a firm started showing up on websites, so I met Mike for example, right? And so you’re browsing a mountain bike website, and then all of a sudden, you can finance this purchase with a firm. One thing that was really interesting and I mean, correct me If I’m wrong, but I think a firm was able to qualify really fast, right? Like, how what, how did you approach that problem? Because that’s pretty, I mean, as a user, that’s amazing, right? I’m about to buy this thing. And it’s not like I’m buying a house and buying whatever. But it’s enough to where I want to finance it. And you can get approved for a loan really fast. But from an infrastructure perspective, being in the industry, that’s heavy duty. How did you approach doing that? Because you’re doing it like pretty early, I think,

Ashwin Kamath 05:32
yeah, you’d see this a lot in data systems and machine learning systems that have, especially in today’s day and age, where there’s a lot of crunching that and data processing that happens in a more offline setting, to create and train these models that when used in an online setting, they basically get this, like feed have features from whatever behaviors the user has already kind of displayed at the time of that decision being made. And so the model itself, when it runs can actually produce a result in under a second, right, however, that that computation that happened that is happening within that one second is taking into account tons and tons of data that’s been crunched in a more offline setting. And it has been kind of prepared already for the online. But super

Eric Dodds 06:24
interesting. So you’re basically you just process all these features, you’re basically just completing the model with known inputs that will the known inputs, have their last features that allow sort of the last mile computed that, right,

Ashwin Kamath 06:39
correct. And then when it even comes to the specific features, you won’t even believe some other features that are being utilized. Here are things like where what kind of websites do you call in from to the site? How are you filling in the form? Are you copy pasting? Are you not?

Kostas Pardalis 06:56
Are you really no way?

Ashwin Kamath 06:58
There’s a lot that can be told about from a fraud perspective about who this person is just by the behaviors that they display. When kind of interacting with a website.

Eric Dodds 07:10
That is so interesting, because to me, that’s very, like marketing, user experience. Those are like marketing user experience data points, right? Like how someone interacts with that, but you’re actually using those as features to sort of check fraud stuff that is so interesting. Fascinating, okay, well, I’m not gonna go down that rabbit hole, because we have too much to talk about, tell us about the unique nature of working with data in finance. So you did it at a firm, then you were managing these huge pipelines for sort of non financial data at a hedge fund. Spectre works for a lot of financial firms. So give us the landscape of working with data in the finance industry or fintech? Yeah, I

Ashwin Kamath 07:53
I think the biggest thing that I have seen with data and finance is how important data quality is. Right? I think because the nature of decisions being made in this industry in the sector, are very high stakes in nature. And each decision can have meaningful impact in the form of trade going out whether or not that’s going to be a long or a short, and underwriting decision being made, whether or not I’m going to give money to someone, it is extremely important that the data that’s being fed into these models that’s being fed to us to grade these decisions, is in a in a good quality stayed. And so what we started to see is the topology of our data configured, the data, and network slash pipelines are configured so to speak, will look pretty similar to other industries. But I think the way that the data for all the sides of things is approached is usually as a first class principle, rather than something that you layer on to pop after the fact. And it is kind of in our best efforts, hope, so to speak. Yes.

Eric Dodds 09:07
So just to make that a little bit more explicit. I’m just sending him examples here. So in a non financial industry, let’s say we have a consumer mobile app or something, right, and you don’t make a good recommendation. And so the person doesn’t make an additional, they don’t add an additional thing to cart on checkout, right, which is unfortunate and may affect a certain subset of users. But if you make a bad loan, you’re upside down financially. It doesn’t take very many of those for that to significantly skew the, you know, significantly skewed. The bottom line is that kind of what you’re getting at in terms of the critical nature of the quality.

Ashwin Kamath 09:48
Exactly, exactly. Or even in a trading setting. Simple example. We’re pulling in data from some sort of external source. And over the last week, the data hasn’t updated, and even all have good data quality monitoring to notice our issue. That data continues to flow into the final trading system trading system gets a forecast saying there’s been no change in the company forecasts. And so start shorting the stock. Right, and that the stakes are high, right? This is a seemingly easy problem to kill the type. But when you have them they take the infinite variety of data quality issues that could occur, and that are pretty difficult to predict in and of themselves. There’s actually a much more difficult problem to make sure that everything is working correctly, even when no one’s like looking at the data all the time.

Eric Dodds 10:44
Yeah. Well, let’s dig into that a little bit. So you talked about alternative data, which is sort of my understanding is that it’s sort of inputs of a large variety that are not necessarily directly related to the trading price of a particular stock, right? We’re for stocks in general, right? So it’s not like trading data from the actual exchange itself. It’s inputs from outside of that that may influence a Can you give us some examples of what those things would be? And the other thing I’d like to know is the breadth of those sources, like, how many are there? How many do you include when you’re modeling? How to even approach that decision?

Ashwin Kamath 11:26
Yeah, there is a ton of data, there’s actually a ton of alternative data. And there’s actually a whole segment of alternative data called Open Source Intelligence and open source alternative data, which is really possible, you’re talking you’re thinking like web behavioral data or scraped data from different types of websites, there is so much that can be told about how the state of a business just from their online digital presence, they, you know, what, if I look at a trend of job postings from a specific company, right, who are they? Are they keeping around LinkedIn data is like an asset, right? What are the trends of job positions that are being held at different companies versus our competitors? Foot traffic data is another big one, where are people going and moving, credit card transaction, credit card transaction data from banks, you generally all of these are anonymized in nature. So we’re not really taking it from the perspective of like, personally identifying information, we’re trying to look at this from a more holistic, somewhat macro, somewhat micro scale of how that data kind of fits in to model the overall economic environment that different businesses kind of play in.

Eric Dodds 12:53
super interesting. And just from the sound of it, my guess is, those are really large datasets.

Ashwin Kamath 13:00
Yes, I UDP, you can sometimes look at like terabytes and terabytes of data, especially when it becomes important to start to look at the historical nature of that data and how things evolved over time. It’s very important to be able to have a large enough history, that you can see those trends as they shape out and as they form and you can start to look at okay, here’s what we’re seeing today, versus here’s what we saw a quarter or two ago, versus here’s what we saw two years ago. This is what we can make a prediction about in the next quarter. Right? And that helps make those decisions as they kind of play out. Right.

Eric Dodds 13:39
Absolutely. Okay, well, one last question for me, and I’m gonna just, I’m gonna tee this up as a lead in for you Kostis because I’m gonna, I’m gonna let you have dessert and ask all about the product because I want to do that, but I’ve been hogging the mic. What were some of the big problems you faced, I think, especially thinking about the hedge fund and all the alternative data inputs from a technological perspective, right? So we’re talking about terabytes of data. We’re talking

Kostas Pardalis 14:08
about losing huge amounts

Eric Dodds 14:12
of money if a simple thing like data freshness falls behind. What were the issues you faced, and how did you try to solve those?

Ashwin Kamath 14:23
Yeah, I think the number one issue was the handoff between a development environment and a production environment being quite slow. And this is pretty agnostic to the hedge fund space. I think we see this across every other industry. And the idea is that there is a lot that has gone into making it really quick to start to explore data to start to build analysis on top of it. Usually you see this done in some sort of Jupyter Notebook environment, like some sort of local environment. And then when it comes to Ashley’s A productionize that analyzes that data pipeline, you know, everything kind of falls flat, there aren’t really any standards here, every company is doing their own thing. The infrastructure layer looks completely different. When you look from one company to the next, some families are using Docker containers, other companies are just putting scripts onto servers and running them in a local conda environment on that server. And no one knows what’s in that conda environment. It’s a complete mess, then when you take it one step further, and say, Okay, now we want to also make sure that the quality of the data that is being output by my data pipeline continues to remain consistent over time. And if I make changes to my data pipeline, I want to know that, okay, something might go wrong at the data layer itself. Now, it’s like the ballgame is even more difficult to deal with, right? You’re talking about monitoring data, which itself is some sort of like, recovering process that needs to run and look at the data and absorb that data over time. And then almost apply another type of machine learning anomaly detector on top of the output or the metrics that are being computed about that data, and make sure that that data is being consistent, right? And then that’s part of the challenge with data science and data engineering is how do you get this infrastructure layer that does a lot of this for you, without having to spend an inordinate amount of effort just on the infrastructural component to and allow you to focus more on what this business logic looks like. Yeah.

Eric Dodds 16:35
Yeah. I mean, because what you’re describing, I mean, you have data science and data engineering, but a lot of what you’re describing actually is more DevOps and SRE flavored

Kostas Pardalis 16:46
work, right where the

Eric Dodds 16:48
uptime and monitoring and alerting and responses in Okay, that’s super interesting, Costas,

Ashwin Kamath 16:55
I say operate elbow rationalizing data science is really an engineering learning problem. Yeah. I don’t think the world has realized that.

Kostas Pardalis 17:04
Absolutely. So I’ve been super curious to hear from you about third party data. In most cases, we talk with people, I mean, like most new MFI, struggling to collect your own data, right? It’s like the data that you’re on call, man, it’s like dinner in one way or another. And you’re trying to make sure that you don’t miss anything and give access to everyone inside like the organization to do that. But you mentioned, like third party data. And I don’t know much about it. So I’d love to hear from you. First of all, how do you go shopping for third party data? Like how do you like, how does this even work? Right? He’s like, I’m good to Amazon. Like, okay, I’m looking for, I don’t know, two pounds of like, data tabs, please. And that characteristic, right? So can you tell us a little bit about that whole lifecycle of getting third party and incorporating third party data into the product that you’re doing? Right, especially when it comes to going out there and find this data, procure the data, maintain that and all these things? Yeah,

Ashwin Kamath 18:28
It’s a pretty laborious process. But it does kind of follow the same steps of what you would imagine from an E-commerce purchase or procurement process. A few caveats in between making sure the data meets your compliance requirements of the company, and making sure that you can evaluate that data in a way that allows you to see that the data is useful to yourself to that to your company, while not getting the data for free. But the biggest challenge here, right, there is of this skewed incentive between the buyer and seller of data to say, Hey, I, I want to let you try this data without you actually like using it for a real decision process. Right and then, but let’s kind of go through the whole process from start to finish. When you first you think okay, there are some use case at hand that you’re looking for third party data, are there are several ways to go about finding that? Most obvious is go Google it right if I’m looking for LinkedIn data because I’m prospecting for a marketing purpose, and I want all US based companies that have chief financial officers on base listen to us. The best source for that would be through something like a like a LinkedIn and being able to find data for that purpose generally involves looking through there like these data catalogs, data marketplaces that essentially kind of have a box of metadata about each of these data sets that give you enough information that at a high level, you can say that that kind of meets the criteria for what you’re looking for. He reached out to the vendor, you had to initiate a conversation. Generally, this looks very similar to any type of b2b sales process, where you go through some evaluation of that data, there’s typically no demo in the process, because that’s all it is a very abstract BIM role kind of concept. So the demo itself, that a demo phase actually looks like you providing some sort of requirements around here is what I’m trying to do, some sample data will be provided back to the classifier that is evaluated in and of itself. I know, within kind of the hedge fund world, usually you’ll look at some sort of historical amount of data as well. So you can test in a back testing purpose. And if that meets the criteria, then you go into kind of the negotiation side of things, discuss the unit price on how much data you’re looking for, generally, with more bolts to get a better unit price per referred, of data. And there are a lot of kinds of levers that you have to think through. The first is like, what kind of sample do I need? Do I? What kind of coverage do I need in terms of geography in terms of sectors industry is depending on the specific data at hand, then second, you need to think through how often that data needs to be refreshed or updated. The world is constantly changing the data, it’s almost anything, making sure that that refresh rate meets the criteria that you’re looking for is extremely important. Third, how are you going to access that data? Is this going to be a push based access where the data vendor pushes data to say an S3 bucket? And you pull it out of there? Or is this going to be pull based access, where I’m pulling it out of an API, and figuring out on my own how and when to store it? Yeah, this all gets written into a contract, once the contract is signed. And usually you go through a certain amount of compliance audit as well to make sure that the data was collected in a way that meets your business’s requirements. And then you get asked for the data from there. Okay. And how do you like

Kostas Pardalis 22:42
John’s? Who the data is good enough for you ladies? Okay, you said when you give a sample the data, right, but is it like some summary statistics that are provided? For example, like how are you can formalize this process? If it can be for, like, how do you without then rebuild the common data set? Don’t usually because they don’t want to do that. So how do you go through that? What are they?

Ashwin Kamath 23:11
Yeah, aggregate statistics helps a lot. Right? Being able to understand if there is one of the columns in A in the data set, and the specific segment you’re looking at, let’s say us, only this might be a global data set, but you only care about the US segment. Yeah, I want to know, out of the population of US data points, how many dollar values are endless other columns, and what are the columns, that’s like a pretty easy way to kind of get a sense of the completeness of the data or the or what you care about. When it comes to evaluation, itself, generally, you’re going to want to put that data through a similar process to how you plan to use it in any live setting, right? One will actually have the real data at hand right and kind of tested from a statistical point of view Does this meet your needs from either a predictive side of things or if you’re collecting data for fraud or underwriting purposes making sure that the the richness of the data that’s coming in seems correct if you’re looking at data about people for save scrape from LinkedIn you might want to cross check some of the entries not to LinkedIn is a manual process but that can go leave some balance to making you call processing that data.

Kostas Pardalis 24:40
Yeah, that makes a lot of sense. Actually. It reminds me a little bit of like, robbing Bob I haven’t. I’m not the only one but they know it was like building like, we reentered or like databases. You have these systems like in production and then you need to debug the bots right? And they’re like, okay, but to reproduce, let’s say, the query I need to query first of all, I need to know like the data. And like the athletes like the statistics around the data, and it’s unbelievably hard to do that. Because getting access and taking a look at that information, it’s like something that’s extremely proprietary for men combined. It’s, it’s, yeah, take a look at my database and see exactly what kind of information like I keep here, like for my users, like it’s, it cannot cannot is in the cabin, and doesn’t feel like it changes also, this thing like change, like too fast, and like, it’s even harder to go into, let’s say, like regression testing, like using some baseline queries and data sets, it’s hard, it’s hard to find like these, these requirements, and we have to know about very happy with that like deterministic system variable like software, at the end, we are not talking about like training models, right? Where I mean, like, we know exactly what is inside the model, right? Like, it’s, it’s more of a black box. So it’s a very fascinating area. And like, I’m very proud of probably, but like people don’t really realize, I think

Ashwin Kamath 26:17
you see this challenge especially. So going back to kind of the skewed incentives, there’s, whenever you go through one of these data evaluation processes, it’s very common, we are kind of the Golden set of data from the vendor, which is their best segment of data that they can offer you today, you can see how great and how powerful this data data is. Now, when you get your hands on the real data, after having signed, say, one two year contract, you now realize that, hey, the rest of this data set is not nearly the sample that is applied, right? Yeah. What’s even worse, is when you start to build stuff on top of that data. And if you don’t have the monitoring in place to watch for things like data decay, data scoring, getting out of whack, and that sort of thing. Suddenly, there’s a larger than average number of outliers that are appearing in the data, it’s very easy for something that worked in the first six months of releasing a new model or releasing a new data pipeline to suddenly start behaving very poorly over time. Right. And it’s it. That’s why it’s extremely important with third party data, especially because you are not the source of that data, you don’t know what’s happening to that data from each True Source. till it gets to you. You only know what happens after it gets to you onwards, it’s important to put those tests in place, put those guardrails in place to make sure that that data conforms to and stays consistent with the assumptions that you made when you first started developing against it.

Kostas Pardalis 27:59
Yeah, 100%, I have a question. It’s about, let’s say, the quantum data again, but it’s about something like the couple of like months earlier. So you’re building a modern rifle and you’re trying to achieve something, you have an objective there? Let’s say you’re trying to do some scoring or predict the behavior, right? And you usually start by having some data, right? Like, it’s a bit of a chicken egg problem, like you have some observations and you try to model something based on these observations, right? How do you reach a point where you’re like, I need to report data? And how do you know what kind of data to go and look out there? For, right? Because it’s one thing to be like, Okay, this is like the data that I can get from my combining because it’s a Clickstream data. These are like the sources where I kind of like captured data, blah, blah, blah, like it’s, it’s much more straightforward in my mind that it is like to have, like, all the different decK the space of different options around the data that you can use. But, okay, procuring data on beans look like you have, I don’t know, like an open space, there are things that you don’t even know that they exist, right. So how do you from a model training and building point of view? How do you identify the data that you need and reach a point where you’re like, I don’t have a style data? I’ll go out there and try to combine like datasets.

Ashwin Kamath 29:34
Yeah, I think educating about the different possible data segments out there is probably the first step and I think it’s going to become much more common for data scientists to just be more aware of what’s out there. Third party data is just kind of coming to the light, I would say, five years ago, the only real buyers that third party data was the hedge fund industry. But over time, now it’s kind of being adopted by several other industries. I see it a lot more commonly used within the marketing space for prospecting and lead generation, being able to use what we call intent data to understand someone just visited a specific site. And that is maybe a competitor or site and maybe that shows interest in them being a buyer of that product. That’s like a good candidate for me to either run an email campaign against them or run an advertising campaign against them. And so I think we started to see a little bit more of just people being more educated about what types of data there is, I don’t know, I don’t know that I have a great sense of when is the right time to start thinking about that. Usually, what I see is that people start to adopt third party data, either very early on, when they’re building and training models, as a way to kind of bootstrap that initial data segment. So instead of taking the approach, okay, if I collect 1000 observations, then I can build my model as i Okay, if I just buy 1000 observations, I can create my model, and then I will keep filling that with more first party data as I collect the first party data. And the second is a more elementary purpose, where I say, Okay, I have this first party data stream that’s coming in, it will be really good to know this other information about these users based on information I can find from their digital presence. And so being able to kind of feed that as an additional data source and keep kind of augmenting that internal first party data stash with third party data as another approach that I’ve seen to be successful.

Kostas Pardalis 31:45
That’s super interesting. All right. So you, obviously, very interesting and exciting career in the financial sector, right? You ended up building a company into products. Tell us a little bit about that. And also what made you decide to go and be like, what kind of problems you show up there. But you thought, oh, that’s like worth pursuing as a business. And I’ll be in my career, like my safety over there, like my comfort zone, where I know what can happen or not, and go and do like a combine in the product. Right? Tell us a little bit more about that.

Ashwin Kamath 32:31
Yeah, so I think the biggest motivator for me was just kind of seeing the sophistication of technology at these more established companies. And understanding that this, the data industry is going to continue to grow at the incredible pace that it is. But when it comes to an understanding of how to handle data in production settings, there has been what I believe to be a pretty big lack of innovation there. Right, every company that I see is doing their own thing, they generally all start with something like an Apache Airflow, where they run their data pipelines, then they’re building their own kind of data quality stock on the side. And then eventually, they upgrade into something else. And that something else tends to be completely different from one company to the next. There’s always a tremendous amount of skill data engineering support, to be able to deliver on that, especially at the infrastructure layer. And so that was the biggest driver for me being able to say, okay, there is a way to generalize some of this technology, to basically create an out of the box data infrastructure layer, that makes it really simple to go from development to production without and have a system that actually helps you do it, rather than you having this inordinate burden to configure things in exactly the right way so that everything works correctly. And so when it comes down to what those problems are, we’re really looking to solve, we kind of say, Okay, on the exploratory side on the development of data pipelines and machine machine learning pipelines, machine learning models, there’s a tremendous amount of tooling that already exists. That kind of solves those problems, right? There’s it’s going to continue to improve, but we want to focus on what it means to take that and put it into a system so that when the data scientist decides to move on to the next project, they can come back six months later and know that their initial project is still working and is is running appropriately, the way that they expected when they launched it, even solve major problems that we see. The first is around kind of the DevOps side of things, right. How do I look? When I want to have data pipelines running in a local environment? How do I push into a production setting, right? How do I make sure it’s running on servers? Is it running on some sort of schedule? Or maybe running whenever data itself is updating? How do I make sure dependencies are being tightly managed based on how data is flowing from one step to the other, based on intermediate data inputs and outputs between each of these stages? And then finally, how do I tie this back to data quality in a way that guarantees that if there are data quality issues that occur somewhere in between in the middle of the pipeline, that that data doesn’t continue to spread and contaminate downstream analysis? And, yeah, there’s, there’s a pretty good analogy I have to go up with. It’s kind of like the way that the manufacturing industry thinks about the assembly line, right. And when you when you, when you think about why the assembly line exists, a lot of it comes down to this idea of being able to install quality control morphers, or nodes in between kind of different stages of the assembly line, right. And the reason why factories are designed this way is because recalls are extremely expensive, right? Both reputation only and logistically, right, bringing back all the items, restating them. The same kind of thing exists in the data world, right? If I push out a data report, and that goes out to say my CEO, and they make a decision, I’ll hold that. And it turns out that was made off of incorrect data. Now reputation only of my data team is at risk. But also logistically after go restate all the data that went into making that report and republish that report.

Kostas Pardalis 36:37
Right? Yeah. So you’ve got to describe Spectras and plot for them. When you say it’s like a data ops platform, is it an ETL platform is the coin T, like how you would call it?

Ashwin Kamath 36:54
Yeah, I will say it’s a data operations platform is the closest way to describe it. But we think of things in four layers. There’s the storage layer, which we don’t really handle, but we integrate with, which is your snowflakes, BigQuery is your data, lakes, etc. Then you have your compute layer, which is data moving from one storage area to the next. Usually in transit, some transformation is happening. This is your ETL, your compute stock, then you have your data quality layer, which kind of reads the data, to make sure that it’s in a good state, it’s in a healthy state. And then finally, we have the control layer, which is the brain of the system that makes sure that as data goes from one step to the next, that it’s taking into account what’s happening at the data quality side of things, to make sure that a data pipeline doesn’t actually run if the sources and the inputs are in a bad or unhealthy state.

Kostas Pardalis 37:53
Right. Yeah. So you mentioned it to interesting terms a little bit earlier, you said something like about data decaying and data storing. So tell us a little bit more about these terms? I’m pretty sure they have to do with quality, obviously. But it’s very, very, very curious to learn more about the semantics of these terms, how they how they are like, represented in the platform.

Ashwin Kamath 38:24
Yeah, so data decay is basically the idea that over time data stops producing the same kind of predictive value that it did not when he was when he first developed against it. And being able to catch that issue in an unsupervised fashion is part of what our platform helps do. Right. And basically, the outputs of data pipelines are automatically monitored to detect statistically significant changes in the data, such as so it’s across the main dimensions of data quality, that your volume, freshness, anomalies within the data, data distributions, cardinality nowness, etc. But without going into the semantics of the specific dimensions, being able to spot those issues. Where in a way that doesn’t require you to program rules about how your data is going to change is actually a very, very powerful concept. Right. Al allows the data scientists to work on the business logic of how their data is being transformed. A focus on the outputs and results and how the system can detect when something is off because it’s statistically inconsistent or significant issue that has arisen.

Kostas Pardalis 39:49
Aha Yeah, but okay. I understand the link are you planning we are using these characteristics of the data as a proxy that something might be going right. Right. But it doesn’t mean that necessarily goes wrong. So, when you have modeling the other side that is doing something, right, like we are reducing needs for, for a reason, right? We have some kind of business objective tied to it. And how do we, I mean, let’s say, Okay, we go to the data scientists and raise your flag and be like, Hey, dude, like, suddenly I see more null values than previously. So that’s an anomaly. Or suddenly, we see that, like, the company model is changing, like, dramatically, right? What does this mean, for the data scientists like what the data scientists can do with this information? Because, okay, it might be like a false positive or like, whatever, right? It doesn’t necessarily mean that something will continue to be wrong, right? On the model and how it performs like, as a service. So what’s happening there? Like, how was that path taken care of?

Ashwin Kamath 41:02
Yeah, so this is where I did actually the part that I think is the most fascinating about the platform, which is that the system actually takes in input from the data scientists to understand what’s important to them. Right? So imagine, basically, this is a data quality system that is learning what is important about each data set, that it’s monitoring, so that it can better track issues, find issues, and start to build resolution patterns for those issues as well. So in fact, one of the big things and big initiatives that we’re taking on right now, is trying to understand that when an issue is resolved, what was a resolution that was taken so that the system can recommend a resolution the next time it occurs, right? And so instead of you having to go into the data and say, delete outliers, the system itself says, click this button, and we will go delete the outliers for you. Right. But what ultimately, what it comes down to is like building an AI system for the data engineers and data scientists of the world.

Kostas Pardalis 42:06
Yeah, that’s super interesting. Okay, one last question from my side. And then I’ll give the microphone back to Eric. So okay, obviously, you are very into data quality, right? So, and you have a lot of experience on loss, like both by building a product and like from like, your work previously, you have like to give an advice to someone who is assigned to start building, let’s say, a new de la da, for more, start investing into data infrastructure for a company writes. What he would say to them about, like how much attention they should pay on quality from day one, or when they should start to do anything about it, if it shouldn’t happen, and can date well.

Ashwin Kamath 43:05
I think it has to happen on day one, at least to start that process of understanding and thinking about what data quality means, onboard that specific use case, that specific problem. Now, I’m ready for this. I think that over time, data quality is going to become more and more of a solved problem, right there, there’s going to be better tooling available. And there’s going to be it’s going to be easier and easier to have to set up a data quality start from scratch. Today operationalizing data quality is actually very difficult. But being able to continuously collect metrics about data as it’s changing, and then have those metrics itself be monitored for anomalies and issues. It takes a lot to get that system up and running. Oftentimes, you’ll see people buy it off the shelf, but data quality tools are quite expensive in and of themselves. And so what we recommend is to figure out what is specifically very important, that’s kind of like, if this goes wrong, it’s going to be a deal breaker, this is just absolutely incorrect. And this might be things like if you have a column that represents the price of an item and it goes negative, that’s clearly a wrong thing. So maybe write a check. Right. What I find a bit unfortunate is that I think there’s a lot that can be kind of learned about the data just by going through the system, these unsupervised systems that continuously observe and track how that data is changing over time. And I think that that is going to become more and more democratized over time. So I would say for everyone out there. Keep your hopes up. There’s definitely it’s out there coming down the line.

Kostas Pardalis 44:59
That’s great. It’s Eric. It’s all yours.

Eric Dodds 45:04
Yeah, this is okay. So I’m going to continue on that line of questioning and we’re, we’re close to the buzzer here. But I would love to know what your advice would be for our listeners who really resonate with what you’re saying about data quality, and about some of the challenges with, say, like your typical sort of orchestration tools like Hola, Airflow, and blah, blah, blah, but the reality is like, that’s what they’ve got. Right? And, and maybe they’re not actually dealing with data that requires sort of the level of quality or accuracy. Or maybe it’s just first party data, right, and they don’t have a ton of third party data. But they know that quality is really important, what advice would you give to them? I mean, you built this stuff from the ground up. And now you’re building a company that solves it? What advice would you give to them to really value data quality, but they sort of have the tools that they have, and they want to implement this at their company. What should they do? And what are the next steps that you would recommend for them? Yeah, I think

Ashwin Kamath 46:21
The biggest thing that I see people get bogged down by and confused around, is when it comes to the appropriate way to orchestrate data quality, data quality jobs, so to speak, right? You see, some people will, at some companies, it’s kind of put directly into a data processing pipeline. So such that as soon as my processing is done, than I did before, it will happen immediately after. And then that kind of like one series of steps that occurs. And I think, where, what my biggest recommendation here is to really think about, so from a rules based perspective, what matters for data quality, and, and structure as independent jobs that kind of run on their own schedules that aren’t really tied to back to the processing directly. Second step of that is making sure that if there are chained data processing steps, that they take into account that data quality status, right? So if, let’s say I say, okay, here are the five rules that matter to me to say my data is in a healthy state, right? Run that in a separate Airflow process, that basically asserts true or false is my data in a healthy state, and use the status of that to determine whether or not another pipeline is allowed to run if it uses that data asset as a source, right? And this is how you slowly kind of build this dynamic system that both takes into account the data processing, as well as data quality, and ties them together in a way that gives you this level of robustness. And this is actually exactly what we’re trying to do with Spectre, basically build that dynamic, low level or dynamic DAG of interactions between the data processing system to the data quality system.

Eric Dodds 48:27
Yep. Yeah, that’s fascinating, because I mean, not that DAGs aren’t capable of considering the things that you just mentioned. But a lot of times, it just deals with data completeness, or data freshness, right, where a job runs. And then a lot of companies sort of manage manage all of the debt that’s created along the way, just with no massive compute

Kostas Pardalis 48:50
on the warehouse, right.

Ashwin Kamath 48:53
And human data supporting him. So

Eric Dodds 48:56
yeah, yeah, for sure. This is like one of

Ashwin Kamath 49:00
the biggest things right? So you got Airflow, you got your data poor quality system, which is like in its own isolated place, data quality system, reporting issues, one after the other. And then your data processing system has no idea, right? So your data processor just continues to process the data and you get 10 chains deep to creating a report. And then you realize, oh, wait, well, the data that was initially ingested into the company was already in a bad state. None of this should have even run, right? Yeah. Yeah. But it is very hard to to like set up that network topologies in a way that that guarantees that data is only going to be run if it’s in a good healthy state.

Eric Dodds 49:43
Yep. Yeah, for sure. No, that’s and that is so instructive. I’m even thinking about our own the pipe signs that I have for Have you ever been fired a lot of thinking there? Where can people go to learn more about spectra data and about you if they want to do dig into this and learn more about the concepts that you’re talking about, where can they go?

Ashwin Kamath 50:04
Yeah, so I am most reasonable on LinkedIn. So you can find me, my upper wall is Ashlyn Kamath. Our website is a great resource to find more information about the product. That’s www.specturedata.com. And we have a contact us form there, where you can reach out to the rest of our team as well.

Eric Dodds 50:27
Awesome, very cool. And we will put those in the show notes as well. So Ashwin, thank you so much this has been I do that absolutely fascinating. I feel like we could go for another hour. But Brooks is telling us that we’re at the buzzer. So, thank you so much for your time.

Ashwin Kamath 50:48
Oh, thank you all for having me. And this was a great, great show.

Eric Dodds 50:52
I think my biggest takeaway, Kostas, maybe this is a weird way to say it that a lot of people think about Big Brother as being the government. And really, Big Brother is just hedge funds, and data about us copying and pasting data, and then that insolence and say that you might be in danger.

Kostas Pardalis 51:14
It’s true.

Eric Dodds 51:18
That’s true now, but it is amazing. I mean, the things that he brought up about web behavior, about foot traffic, data about credit card transactions, all this sort of stuff. I mean, it’s a little bit scary, in many ways.

Kostas Pardalis 51:35
anonymized

Eric Dodds 51:36
you guys know, but it’s wild. I mean, the stuff that, you know, he’s done, and that level of data modeling, and that level of granularity is amazing. And I think, as he said, like, the actual infrastructure to drive that is, is incredible, right? Kind of I mean, the blunt way to say it is that like the two industries that are actually driving infrastructure for it are like corn and finding, like, they’re the ones on on sort of the significant scale, like innovation side of things. And I think we saw that with ash.

Kostas Pardalis 52:17
Yeah, absolutely. And I think it’s, well, I felt like super interesting that later, you can talk about a topic that we have discussed and loads already, right, like data quality, for example. And how much of a different perspective someone can bring, because they’re coming up from a different industry right by even the terminology that she was using. About, then apparently, they was very different compared to what we have shared, like from other vendors that are building Dell. Right. So that’s where this I find super, super interesting. You’re so privileged that I’m doing this show because I’d have the opportunity to compare these different let’s say theses around like how to build a product, when calculating the buyers that each person has because of the industry where they have to solve that problem for right. And of course, like to see up the end it was going to win because that means that we all here, which industry has a much better, let’s say understand equals the problem. So yeah, like, super, super interesting.

Eric Dodds 53:30
Yeah. The now that we’re talking about this, I regret not asking him if he had worked with Deephaven because they work in the finance industry and do like real time data feeds. So well, we can follow up with him. That’d be as actually we should get him and I think it’s Pete. Is that right? Brooks? Pete from Deephaven. Burke’s giving me the thumbs up.

Kostas Pardalis 53:54
Off screen.

Eric Dodds 53:56
Great. Well, let’s do that. Let’s follow up with him. If so, then we can do like a finance data. Podcast, maybe we could actually get the three from who used to be a Robin Hood is and is now at Stripe. That’d be cool. All right. Well, thanks for thanks for entertaining our banter for another episode, and we will catch you on the next one.

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.