Episode 160:

Closing the Gap Between Dev Teams and Data Teams with Santona Tuli of Upsolver

October 18, 2023

This week on The Data Stack Show, Eric and Kostas chat with Santona Tuli, Head of Data at Upsolver, a data pipeline tool that focuses on application developers. During the episode, Santona discusses her background in nuclear physics. The conversation covers topics such as specialized tools in data infrastructure, the challenges of working with data in different environments, and the importance of data as a business differentiator. They also discuss the features and capabilities of UpSolver and the need for quality and observability in data processing, and more.

Notes:

Highlights from this week’s conversation include:

  • Santona’s journey from nuclear physics to data science (4:59)
  • The appeal of startups and wearing multiple hats (8:12)
  • The challenge of pseudoscience in the news (10:24)
  • Approaching data with creativity and rigor (13:22)
  • Challenges and differences in data workflows (14:39)
  • Schema Evolution and Quality Problems (27:01)
  • Real-time Data Monitoring and Anomaly Detection (30:34)
  • The importance of data as a business differentiator (35:48)
  • The SQL job creation process (46:25)
  • Different options for creating solver jobs (47:20)
  • Adding column-level expectations (50:17)
  • Discussing the differences of working with data as a scientist and in a startup (1:00:18)
  • Final thoughts and takeaways (1:04:01)

 

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:05
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. Welcome back to The Data Stack Show, Costas boy do we love talking with people who have worked on really interesting things like colliding particles that explode and teaching things about the way the universe works. And today, we’re going to talk with someone who has not only done that at CERN, but Shannon Tona, has worked in multiple different roles in data, and L, NLP, all sorts of stuff in multiple different types of startups, and multiple startups in the data tooling space, actually. So kind of a little bit of a meta play there. Which is interesting. And she’s currently a solver. Fascinating tool. And there, I’m actually going to say there are two things that I want to ask one I have to ask about. nuclear physics. I mean, she’s a PhD, right, we have to ask her about that. But I’m also interested because observers are really focused on it’s a data pipeline tool, but they’re really focused on actual application developers, usually, you would think of that as an ETL flavored pipeline that’s managed by a data engineer. But they’re going after a different persona, which is really interesting. So those are two things that I want to ask about. How about you?

Kostas Pardalis 01:50
Oh, 100%? Eric, I think we definitely have to spend some time with you talking about physics and science. And about your journey in general writing, I mean, it’s very fascinating to see people that they have like the journey that she has, from, like, you know, very core science to data science to products and data platforms. So we’ll definitely do that. Now, I think what we are seeing here with observers is very interesting, I think, like trends when it comes to data in France in general. And we see that tools tend to start specializing more and more. And that’s like a result of, let’s say, both the scale and the complexity of the problems that we have to deal with today. Right? So absolutely the reason for the ingestion tool. But it’s not like a generic, let’s say ingestion tool. It’s something that’s like dealing specifically with production data, right? Things are coming through CDC, for example, and streaming data in general. And they are also dealing with a very common problem that data infrastructure in particular has, which is that there are just too many different stakeholders, that there are parts of the lifecycle of data. And you can’t just isolate the product experience to one of them, right. And that’s like a decision that we see here from a product perspective that Oh, like, we have the people in the production database that we are also like responsible for this data, the end the generation so we can like keep them out of the loop. And they take a different approach here, which I find very interesting. I think, regardless of, I mean, how successful this is going to be. I think it’s very indicative of the state of affairs today when it comes to building robust and scalable data platforms. Yeah.

Eric Dodds 03:51
All right. Well, let’s dig in and learn about nuclear physics. And see if you’re right about how to build a scalable platform. Shawn Turner, welcome to The Data Stack Show. We’re so excited to chat with you.

Santona Tuli 04:06
Hi, Eric. Yeah, excited to be here. Thanks.

Eric Dodds 04:09
All right. Well, give us your background, fascinating background that started in the world of nuclear physics of all things. So start from the beginning, and then tell us how you got into data.

Santona Tuli 04:22
Or will I be kidding? I got it. I got my PhD in physics, setting nuclear physics, as he just mentioned, worked at CERN, colliding particles at very high energies, and then analyzing the aftermath of those collisions. And the goal here was to answer questions about, you know, fundamental physics. Why is the universe the way it is today? How did it all start and how did it evolve? So really interesting stuff, but you have to work with a massive amount of data, and it’s sort of like seeping through. It’s sort of a sensationalized piece, but it’s kind of good for reference. It’s called a needle in a haystack or something, something like that. It gives you an idea of the order of magnitude of data and sort of how much you have to sift through the noise in order to get the signal out. So it was a lot of fun. That’s another way of saying it was a lot of fun. Some engineering data engineering aspects, some science and analysis aspects. And then presentation and you know, writing papers and stuff, all of which, like separately, I enjoyed very much. So Eric was sharing before we started recording, like, what do you want it to be when I grew up? Almost, that reminds you. When I was very little, I wanted to be an author, and I wanted to be a scientist. And so those two things do kind of come together. And then a lot of my work. And at this point, the audience I’m curious, curious about you, are

Eric Dodds 05:55
you? Yeah. So I do have to ask, how did you get into like, how did you decide? I know you said, from a young age, you wanted to be a scientist? When did you know you wanted to get into physics, and then specifically nuclear physics? Like what drew you to that specifically?

Santona Tuli 06:13
Yeah, so I think I was drawn to physics from early in my high school career, because I had a high school teacher who was just really expressive. And like, demonstrated with the showing off physics. So like, he’d have like a hot cup of tea in his hand, and then he’d do the whole centrifugal motion thing. And it’s like, yeah, that’s, you know, it’s physics. This is why it works. So that’s sort of like storytelling and like visual aspects of it. But I think I was drawn to that. And, I mean, it’s one of those things, which is very unfortunate, but it’s one of those things that people either kind of hate or kind of love, just like math. I think it’s a little bit like You’re conditioned to you know, as soon as you hit a wall, you think, Oh, I hate this and whatnot. I just really enjoyed solving physics problems. So I mean, you couldn’t say that I loved it. But on the other hand, like, it saw that I didn’t find it challenging. I just really enjoyed doing it.

Eric Dodds 07:16
Yeah, very cool. So you went to work? After working as a scientist. And actually, I mean, amazingly, you fulfilled both of your childhood dreams, probably. So you, I’m sure, authored a bunch of papers as a scientist. And so once you fulfilled your childhood dreams, you went to work for startups. So tell us, how did you do it? What drew you to making that transition? Why did you choose startups?

Santona Tuli 07:47
Yeah, that’s a great question. I was thinking about this the other day, so I’m on my third startup now. And every time I accepted an offer with a startup, I had a competing offer from a larger company. And somehow, for different reasons, I think, but maybe, you know, subconsciously, for the same reasons, is always always a startup. So there must be something there I think about the first time around. I wanted to work in NLP. So the first startup I went to was in the NLP sector. And as an ML engineer, versus a data scientist. So I think that kind of drew me. But once I was in, I was hooked, I would say, I think since then, it’s just the fast pace. You know, learning a lot is getting to, but also kind of being forced to do a lot of different things, you know, where a lot of different hats and just filling in wherever the gaps are. I really enjoyed that. So I’m not the kind of person that is super content with having a, you know, just having a spec, and then you go and you do it. And that’s all you know, it’s everything that’s within that box. I just, I like higher levels, I like seeing how things are, how my work touches other people and how they’re interacting and stuff. So yeah, so I went to work as an ML engineer. And then from there, I went to work as a data scientist with astronomers, which is the first tooling company. So I’m at my second data science tooling company now. And I, you know, it’s this sort of charmer, my role was at the intersection of data. And I mean, I was a data scientist, but I ended up doing a lot of things like product work, and interfacing with the rest of the company and what they needed from the data team. And having like making those cross functional relationships and then dogfooding the product and feeding that back into the product. So I really enjoyed that. So like all of these different dimensions were coming together, and then add up the solver I do I bring all of those things together. So I was doing internal analytics, work in data. But I also do product strategy. And you know, a little bit of product marketing like thinking about what we’re building, who we’re building it for or how to make it better for that target audience and then how to phrase it such that they see the value and what we’re building?

Eric Dodds 10:09
Love it. Okay, I do actually have a question for you. That’s good. I want to dig into your work at startups and what data but having done science at such a high level, is it hard for you to like, see a bunch of pseudoscience in the news? I mean, you have all people probably have the ability to discern, you know, when, I mean, especially, like thinking about things like statistics around science, I mean, I’m not an expert, but the news media, you know, can be pretty, you know, they like to create headlines, right. And so when there’s scientific things, especially related to statistics, I know a lot of times, you know, they can run a little bit fast and loose. Do you see that all the time? Like, you probably can’t help that, I would guess?

Santona Tuli 11:00
Yeah, I do. But I mean, there are two sides to this, right. On one hand, I’m really glad that the news is coming out because one of the things that we struggle with in academia is getting funding, for instance, for doing the research that we know, it’s so important to do. But we have to convince governments and other institutes to also fund that. So, like, our work getting in the news is actually really good. Or like, academics weren’t getting the work, it’s actually really good. So on that, on that, in that sense, I’m happy. But on the other hand, like the most recent one was with the room temperature superconductor, right, like, if there was this paper and like, all of a sudden, everyone’s talking about it. And folks who don’t have a strong sense of what the results mean, or what it would meet, what you would need to get there, or talking about it. So again, positive awareness is great, but the negative is okay. Are we over promising? Or are we saying we are misinterpreting the results and thinking that we are somewhere where we’re not yet. And I mean, being not at Lake being outside that domain? Like I was in nuclear physics, this isn’t superconductor physics, right? I don’t have a super great understanding of everything. But yeah, as a scientist, and as a physicist, like, definitely come in with that skepticism. Okay, let’s look at this paper. Let’s look at that slot. Let’s look at what aero bars they’re quoting. And you know, what significance they’re, they’re claiming to have, because we were SOAP Hadapt. I mean, in a good way, for certain, and in particle physics in general, the statistics were so important, like getting not just the number, but the error bars on it. Right. Yeah. And, you know, seeing how different it was from like, the null hypothesis and stuff. So, yeah, these are things I think once they’re sort of drilled into you, you never like, let go of.

Eric Dodds 12:48
Yeah. Well, thanks. That’s actually super interesting. Okay, so let’s actually tie together your work as a scientist with your work and data. And so one thing that’s really interesting to me is, and let’s maybe use CERN as an example, and I’m speaking way out of my depth here. But, you know, as an outsider, when I think about your work there, it seems that there are sort of, obviously multiple components, but one of them is highly exploratory, right? Like you’re trying to answer really big questions. There’s an element of creative thinking that goes into that discovery. And then there’s also this extremely high level of rigor, right? Where like, the, you have to get the aerobars, right, because you’re holding yourself to a very high standard, right? And so that means, like, process and operations and, you know, all that sort of stuff. Do you approach data in the same way? Right? I mean, data, if there are creative, exploratory discovery focused elements to it, that requires a huge level of rigor, like, what are the sorts of similarities in even differences in the way that you approach working with data or things that you learn as a scientist that you brought with you?

Santona Tuli 14:04
The short answer is yes, I tried to approach data out by data work today, the same way that I would approach it when I was working with particle collision data. However, there are clearly differences, right? I think one of the main differences as far as functional, day to day work goes is the deadlines are a lot shorter. Right? And it’s, like, comes with the level of rigor in detail in particle physics or any other kind of large data physics if you check it over. And I think there are some inefficiencies in that as well. It’s not just like, Okay, you’re taking it over. And that’s good. I think that we, at least within the confines of my group, the group that I worked with, which was a 60 person group, in a much larger 5000 person collaboration It’s not as process oriented as, like things sometimes are in industry. So you’re sort of, like, it’s less clear where, who’s blocked by whom, it’s less clear what the next steps are, it’s less clear how good the best way is to provide feedback or, you know, do a PR review. So those are those things. Looking back now, I think, okay, these were lacking, like, I could go in today and make a bunch of process improvements to, you know, to the workflow there at CERN or Davis. And that would help move things along a little bit faster. But I mean, with that, let me say, like, it takes time to do an analysis that is on such big data, and, you know, going into so much depth. But I guess on the flip side, What I miss is people carrying about error bars, right? It’s an industry, you get the result, and then you sort of move on, it’s not very common, actually think about, you know, what the systematic uncertainty is, even if you do send, think about statistical uncertainty, you usually don’t think about, okay, what biases have I introduced in doing this analysis? So I do miss that. So I just entertained myself and you know, like, reading academic papers and stuff like that. So you know, this was not bad. At the end of the day, like, nothing is bad, impactful, the best strikes, but, you know, if you’re like doing if you’re selling an item, like it’s not as impactful, in some ways, if you get a little bit wrong, compared to like, you’re making some claims about having discovered, you know, a new particle. But yeah, I mean, I’m sure there, there are folks who would argue just the other way around, right? That’s like pie in the sky.

Eric Dodds 16:51
Yeah, you accidentally recommend the wrong product to someone versus making a fundamental mistake about the basic functionality of the universe? Well, tell us, sir, you’ve had a journey at multiple startups, and you’re a solver. Now, tell us what the solver does.

Santona Tuli 17:14
Yeah. And up solver, we’re building a data export and load tool for developers for application developers, that helps get that data and get data produced in operational databases. And just like data that’s generated when you have an application that’s in production, folks are interacting with it. So some of it is like, what are users doing in there, some of it is like deeper transactional data, getting that data into wherever it needs to go for other use cases. So downstream use cases might be analytics, ml, whatever it might be, whether it needs to land in a warehouse, or data lake, that’s we’re focused on getting the data there at scale, at the same scale, that the production databases are actually producing the data. So we’re not holding stuff back. And with high quality, so as a developer, you know, you’re used to being able to look at your, you know, look at your data, test your code, and all of these things like things that we sort of take us like granted in any authority, or tooling, or like, for example, being on call and getting alerted when something something goes wrong. So we’re bringing those same sorts of engineering practices into data tools. And we’re really thinking of the application developers as the folks who would like to feel most natural in our tool, I think, but I mean, it’s anyone who’s doing data ingestion into a data warehouse is also, you know, this refers to them. But we’re basically replacing a bunch of do it yourself stacks. For this, like complex hardware data from operational databases and streams and such.

Eric Dodds 18:58
Yeah, I want to dig in on the Developer Focus, because that’s interesting, because when you describe the products, I’m thinking data engineers in their building, to your point, they’re building a pipeline that is ingesting some sort of logs, you know, application data, etc. And they’re building, you know, your sort of classic ETL pipeline, or, you know, even streaming, you know, depending on the use case. So I started going squarely towards the data engineer who’s going to be building and managing a pipeline, but you, which it sounds like, I mean, of course, that person can use it, but you said developer, like an application developer, can you dig into that for us? Because that isn’t what you would think of, you know, as the target persona describing what you would, you know, sort of an ETL flow that would typically be managed by a data engineer.

Santona Tuli 19:52
Yeah, absolutely. Yeah. I think it’s really interesting. We were having this so Roy Hudson here, you might know him as well, he breaks it up as well, we’re having this discussion about, like, who is our product for? And we decided we just wanted to meet, we just want to meet teams where they’re at. So what do we mean by that, from my experience at a previous team, being on the data side, we would get the CDC data. So the Change Data Capture captured from operational databases would be dropped off in a storage bucket that then we would have to pick it up from, so there is no expectation and there was no like, we weren’t allowed to go all the way to the source and get the data from the database. So there was a separation. I was like, Okay, maybe that’s not everyone. But there are teams where that is happening. And we want to make a tool that sort of bridges that gap. And that, you know, if you’re a developer on the application side, you can send the data not just to an S3 bucket, right, but all the way through to Snowflake, you can write that, write that injection, and easily, you can read that injection that directly does that. And this is a tool that, you know, your data engineers can also give them access to. And they could be writing those pipelines as well. So it’s just like gluing together, almost or like filling that gap, bridging that gap that exists today between like, the dev team and the data team, because of the way that you know, we’ve been doing things for a little while. So yeah, again, anyone can use it. But we want to meet whoever that person is. Right? That’s what’s responsible for it today. And one of the things that we also notice is, we’re building data tooling, we usually build for data personas. And there’s at least this idea, and I think, to some extent, a fair idea that it’s less, like some of the engineering rigor isn’t there, it doesn’t have to be there for these tools. Because it’s because, like, apart, maybe partly because there’s a lot of batch processes going on, right. So you can wait, your SLAs aren’t as, like, you know, do or die, right? If it’s a dashboard, then it can be a dashboard that updates, you know, let’s say every six hours, not not on the, on the minute or something. So, and that’s fine. And for smaller scale data, or like business data, that makes sense, like your customer success, person, maybe does not need to, like, constantly watch a customer, right. But if it is your product data, your product data, right things that your users are doing within your product, things that you’re proud of your microservices are talking to each other, they’re, you know, communicating through message buses, and, you know, sometimes you make decisions, like not absolute real time, but within, you know, some short timeframe, you want to make decisions about your product based on that data. That’s what we want to enable, like, do it fast, near real time, do it at scale, and do like, do it with certain quality and observability measures, so that you’re not like making any sacrifices.

Eric Dodds 23:04
Because you’re working with data. Yep. Yeah, that makes total sense. And can you just walk us through so you said that, you know, as the data team, you’re gonna get, you know, dumped from the production database into an S3 bucket. And there’s sort of a, you know, let’s say, the application developers are just sort of throwing that over the wall, right? It’s like, we need data from the production database, and they’re gonna be like, okay, great, we’re gonna replicate it, or CDC it or however they get up there. And here’s your bucket, right? And so of course, that creates issues, because it’s like, well, you know, we need to like, change the schema, or there’s a bunch of issues with the data. So that creates much work for the data team. Is that typically, the flow? Like, is the data team asking for the dump? And so the application developers just sort of figured out, like, whatever their preferred way to get it in the bucket is, is that usually the typical flow?

Santona Tuli 24:05
I’ve definitely seen it that way. And especially its startups, right. Like, when you’re one of the first maybe the first person or one of the first few people that’s starting to think about data and making data based decisions at a startup, you kind of have to, I’ve had to do this, I think you kind of have to figure out what all the data is and where it all lives. And none of it, you know, brought in yet into there is no warehouse yet. So I’ve definitely done that myself and seen and know of others who like sort of as a data scientist will have to go to the app folks and be like, hey, I need to analyze this. This is important for my work. But also, you know, like, it’s also true that app developers care about their data. They’re, you know, because everyone wants to, you know, understand what they’re building and what the effect that it has on other things, right. Sometimes as app developers I think we’re like production engineers, I think we were kind of we’re kind of in the nitty gritty of our backlogs, and like, we’re moving on to the next thing for the next sprint rape. But this and like, it’s like someone else is making the product decision. And as soon as he’s coming, and then today, I’m working on something, and tomorrow, you know, it’s going to be totally different. But from my perspective, as a production engineer, as well as like, I really want to know, how my product is doing today and what it’s doing today, what, what’s, what is it lacking? So yeah, I’ve seen kind of both directions. I say, just to, you know, round up, round up that answer, I think that CDC is definitely not new or database replication, right. It’s also useful for various needs other than analytics, but it’s, you know, I think usually whoever, like you’re coming from two different directions towards the same data. And you have different use cases, or different stories in mind. We want to facilitate that like coming at it together and building something from the get go, that’s going to sustain and it’s going to scale.

Eric Dodds 26:08
Yep, that makes total sense. All right, one last question for me. And I’m very excited to hear what Costas is gonna ask you. But you talked about maintaining a certain threshold of quality. And so I understand. And I think a lot of people understand that, you know, if you just get data, you know, he, you know, a dump of a database, right? Or a bunch of logs or whatever it’s like, okay, like, you know, we have to have jobs that run cleanup, and all that sort of stuff. So that makes sense intuitively, that your product would help facilitate that. But can you talk about some of the specific quality problems that relate to application data? Like what are the specific flavors of quality problems that you generally run into with application data?

Santona Tuli 27:01
Yeah, one of the most obvious ones, and I think you were sort of hinting at this earlier, is the schema schema evolution, right? My payloads are going to change as my services talk to each other, or wherever. So when we say prod data at up solver, we’re defining it pretty widely. So we are talking about database replication. So we support various source databases, but we’re also talking about consuming from message buses, right message queues. So because that’s also part of how you know, applications are operating and interacting with each other. So but if I’m building for if I’m, if I’m building a product, for end users, and I have, you know, 1000s, or hundreds of 1000s, of end users using my product every day, then I’m going to want to make changes, I’m going to want to improve that product and move fast. And on to the next thing, again, going back to that constant backlog and sprint cycle. So I don’t have as much time to, you know, promise a certain schema, and then like, make sure I adhere to it, and then make sure I deliver it that way. So that’s maybe just one of the reasons that schemas evolve. But the bottom line is that schemas evolve. And being able to on the receiving end of things where you don’t want that to break your, your analytics pipeline, you don’t want it to break your dashboard, and like you’re not and you know, the other fact of the matter is that if I’m a data engineer, and I own, you know, maybe six or seven different data ETL pipelines, right, I’m not watching the data constantly, at least there isn’t really, there aren’t, we believe that there aren’t really great tools out there that are watching the data like proactively not just like after it’s landed in a warehouse or something. So oftentimes, these things when there is a breakage of some sort, or some dashboard is showing incorrect numbers are something often that is caught by consumers. Now fortunately, data to consumers is usually internal. So it’s not like the worst thing in the world, unless you’re doing some ml, that’s end user serving. But you know, that sort of experience, right, like your care partner, reliability engineering partner comes to you and say, Hey, this is all messed up, what what’s going on, and then you have to do this lake back, you have to look back and you have to like Drew, you know, mystery solving, to figure out what’s going on. So that’s the kind of disconnect that we talked about, or that, you know, I mean, I know that that’s part of the discourse right now a lot of folks are talking about this is like that divide between Dev and data. So I think schema evolution is one that we’ve all really felt. And so being able to automatically adjust to that. So what we do is if your schema changes, whether it’s CDT CDC or streaming and this is actually important for CDC because, you know, for in the case of Change, Data Capture, or database replication, you might have the entire table that’s added, right? You might be consuming from a bunch like 50 different operational databases and leaks, you know, something major changes. So being able to adapt to that in real time without bothering you. And without, like breaking anything. Yeah, we just created, there’s a new table, we created a new table, in your Snowflake, or whatever it might be.

Eric Dodds 30:20
So it’s that schema evolution is a really big one. That’s super helpful. Yeah, new data sources are always like a huge downstream impact of like, such a painful thing to deal with.

Santona Tuli 30:35
Yeah, exactly. And then our observability tool, again, like you might be watching, it lets anyone that’s involved in the space, from def to data, watch the data as it’s flowing through. So you have real time late volume tracking, like, sometimes the volume goes up, what’s going on, sometimes the volume goes weirdly down, maybe there’s an outage. So you know, being able to investigate that having that live in front of you. If there’s some other way in which you can spot anomalies like we do, we always let you know what the top values are, at any given time within a timeframe. And compared to how that’s changed from before. You know, last scene for scene information on the kinds of things that are sometimes information schemas that are hard to get to. And then some additional stuff, we just put everything upfront. And then lastly, well, there are a lot of features that I can talk about. But the other thing I wanted to mention is you can set quality expectations, in your data movement pipeline, on specific fields, or values for specific fields. So you can coordinate quarantine bad data, or just tag it and get a warning and so on. Yeah, for me, those are like quality aspects. And then there’s a slightly more technical one, which I will mention, which is because we handle streaming data, or consume from screaming streaming sources, we do exactly once and have a strong ordering of data, which also is really helpful if you’re working with streaming data.

Eric Dodds 32:10
Yeah. Super helpful. All right, Costas, Europe.

Kostas Pardalis 32:15
Thank you so much, Eric. So Santona, not that you talked about like, a lot of like, very interesting things. But I will probably top again, at least a few of them. But let’s, what I would like to ask you to do is put your product hat on, right? And let’s help a little bit like our audience to understand like the use case is like we are talking about, like all this data, we are talking about streaming data, but processing, you know, CDC, like all these things. But before we get into the technical stuff, why do we do that at the end, right? Like why, let’s say what are the most common use cases that you see out there or like people, for example, care about consuming a CDC stream, right? You mentioned for example, that with the app solver, you can take the data like from Postgres CDC feed, for example, it like directly, like, push it into like Snowflake, right? Not dump it like on S3 or something like that, and then prepare it like to load on Snowflake. So why do we do that? Like, what are we going to do with this data? In Snowflake, right? And what’s the difference between the data, right, like, are they identical between, like, do we just replicate what’s happening on Postgres, like, on Snowflake? Or do you see something else happening there?

Santona Tuli 33:48
Right, yeah, no, that’s an excellent question. I’ll try to put on my product hat. But I’ll actually start by saying, as a data scientist, I want to solve problems for the business, right? Again, thinking about the higher level and the big picture, especially as like, when you’ve been doing it for a while, you’d learn to start to think about okay, what are the questions that we need to answer in order to make good decisions for my business? And then at some point, you go past the you know, 20 or 60, let’s say or so questions that you’re going to answer every company that you work at like what is my what when do I call my customer healthy? When is the customer likely to churn and be late you know, when are my scenes supported, ticket ticket spikes and stuff? Once you move past those things, right? There’s going to be questions about your product itself, not just like Clickstream data, I’m not just user behavior data, although that data is also extremely important, but more like more in depth, what is my product? What is it doing? When is it you know, what is its peak usage like when is it faltering when does it what are the times when And I, you know, my user comes to my website, and they have to wait an extra millisecond or something, for something to load those things, those types of questions as you get there, then that is when prod data becomes really important. And that’s one from the analytics point of view, that’s one side. The other is if you’re literally like if your product is based on data that your users are generating alive. So one of our, one of our big, one of the big use cases we see is for ad tech, where you have to do sort of ad attribution based on folks actually being online and what they’re doing. So again, that’s, you know, the data is being produced at high scale, and it has to be near real time. So that’s one thing we see. So from the analytics point of view, the way I see it is that crowd data is your most data. So we talk about business mode, so you’re your entrepreneur. So the business models, what differentiates the business from other others in the relevant space. And so I think of prod data as your most data, because well, two things, one, it’s data that you uniquely have, because you’re generating it’s literally your product. And so like, you know, it’s, it’s something that no one else can have. So in that sense, it’s a mode. But then the other aspect is you have to unlock it, you actually have to, you know, use it and get it into your warehouse and do the analytics, and then it becomes, you know, a true differentiator for you. So, yeah, so for me, that’s why prod data is important. Our operational data is important from an analytics point of view. And then I talked about the use case of ad tech. And then the other. Another set of users we have is some larger, like in the healthcare service industry, for example, where or whenever you have multiple kinds of interactions that the user is having with your product. So for example, if I’m a managed health healthcare provider, then there’s the provider Doctor component, there’s the, you know, individuals who are utilizing the service, there’s the insurance components, like all of these things are usually kind of well separated. But you have to consume data from all of them, and then sort of consolidate and do analytics on that. So that’s another way like it, maybe it’s not as like real time as ad tech needs to be. But it is still like, you don’t want to have a big mismatch between when someone went to see a doctor and you know, when they’re going to get surgery. Right. So having all of that data come through. It’s like, that’s another big use case that we see.

Kostas Pardalis 37:43
Okay, that’s super interesting. And what about Okay, let’s move on to something else that you talked about, like with Eric about, like schema evolution, right? And obviously, things evolve, right? Especially when we are talking about products. And I think you build it very well. They’re like, there’s no way that the database that you have for many reasons, like from performance from just like, the product itself, adding features or debugging, or there are many different reasons, right? So the data will check, like the schema itself will change at the source. Many ways, many times, it might also change in a way that can be tricky, right? Like, very subtle changes. But we’re talking about machines here, right, right. In the human brain, zero and one and true and false, might semantically be equivalent. But this doesn’t mean that like it’s also true, like for the machine, right? And the developer might change it, and things tend to break there. So in a real time environment, let’s say and when I say real time in, let’s say, like in a streaming environment, right, where, okay, you have an unbounded, let’s say source of data, you don’t know exactly how the data will keep getting generated. So you have to react fast, I guess, right? How do you deal with that when you have like so many downstream dependencies, right? Because like one column type changes at the prod database, and you might have hundreds of pipelines that one way or another? Depends on that. Right? So how do you deal with that? Both from what you’ve seen, like as a vendor that is trying to give solutions, right, but also, like from your users, like what you’ve seen out there?

Santona Tuli 39:46
Yeah, yeah. It’s a hard one. Like it’s, or maybe I should say, It’s a painful one, right? It’s something that it’s hard not to experience if you are building these pipelines and then use cases on top of them because as you said, once the source data gets to your warehouse or lake, then all of a sudden, you know, it’s being modeled. And it’s going into this pipeline and that pipeline and like, so if you don’t catch it at that, at that very beginning, it really kind of is bad news bears later on. And that’s kind of why we’re building what we’re building. So I mean, I was, as a practitioner, having faced the pain and felt it. And the only real solution is, you know, either having or both having a full picture at all times of where your data is going, and what deliverables it’s feeding. So like lineage. And also like being able to appropriately lay so just you made you make a change somewhere and making sure that it actually flows through to the right places at the right time, while minimizing the amount of Lake re computation, right, because you also don’t want to like to go and replay 100 different pipelines. So as a practitioner, like, it’s a lot of things to keep in mind. But that’s sort of the approach is like, just like just have a very good sense of and visibility into your data pipelines and the relations in between them. And then as a vendor, and this specifically in ingestion, space, like, that’s what we’re, that’s a pain that we’re looking to minimize. So when you do a bunch of like type resolution, and like you said, it’s something like a column type suddenly changes, like, how do I deal with that. So what we do is we do in the short term, we make a copy of that, and like, add a suffix saying that like this is now this type, and then join it. So there are things that we do that like, you would have, like, we do it automatically, so that as much as possible, it prevents bracing a bunch of pipelines downstream. And just having that visibility, I think, is huge. Like, okay, when this happens, as soon as that happens, you can know, within your solver, that okay, this is weird, this isn’t supposed to happen, this is what it used to be. And this is why it changed the timestamp when something changed, and sometimes things kind of fix themselves, right? So you know, and I know it’s especially for prod data lake, okay, something changes, and then rollback or something. So having those timestamps also like this is when this thing changes, and then this is when it changes back. So you can go back and you decide what you want to do with that data in the middle, maybe it’s, you know, irrelevant in the grand scheme of things, and you just drop it or something like that. So for me, it’s really the value prop. And, you know, not not really even speaking, as a solver for me, as a practitioner and user flip solver, that Prop is, like just being able to watch the data. Yep. Yep.

Kostas Pardalis 42:48
That’s awesome. And that brings me to the next question that has to do with quality, you mentioned that, like, the user is able to show expectations, right? About like, the data. And that’s like, the way that you put some quality checks there. Can you elaborate a little bit more on that? Like? No, I have two things that I’m very curious to hear about, like, first of all, what is the best way right to, for a user to go and like, define these expectations? Right? Because there are like many different users that interact with the data, and not all of them, you know? Like, they prefer the same kind of API’s. Right? Like, some are like more engineering, some might be like, more of like a data scientist or like an analyst, right? So one thing is that like, whilst, like, let’s say, the trade offs there, like to find, like the right way for people to define these expectations. And the other thing, is that, that I would like to hear from you is that what exactly is an expectation? What are the common expectations that you see out there? Because, like, okay, technical can be anything, right? Like, you can ask any question about the data and set it as an expectation, right? So, but I’m sure that like some patterns, like standard things that people are looking for, or things that they like to avoid, or some that might be computationally too expensive, like to go and accept them as expectation. So there was a little bit more about that part.

Santona Tuli 44:16
Yeah, absolutely. So the quality expectations is a fairly new feature that we rolled out, I think about a month, maybe month and a half ago. So it’s new, and it’s fresh. And I might miss a few things. So that does, but let’s talk for a second about the user experience in the product because that was your first question. So you can offer up solver ingestion pipelines a few different ways. And exactly for the reason that you said is like we want to cater to different kinds of users, right? So we have a no code version. I mean, it’s on different versions of the product. It’s like you if you want to if you just if you’d let’s say you have a Kafka queue That is your source and you have a target, that’s your Snowflake, you can configure the target and the source in no code like GUI based like a wizard, we call internally ingestion wizard. So you do the connection strings, and you do the target connection strings. And then the next thing it’s going to ask you is, okay, this end, it’s gonna give you a preview immediately of the data. So if it’s a coffee cue, and you’re gonna see like, you know, 1020, however many examples it ends with, what do you want to do? How do you want to pre process it? Do you want to pre process this? Right? Like something? So I’m going to do automatic, like exactly once and strong ordering. But how else do you want to pre process it? So you can go in there and say, okay, and you can look at this email, you can, you know, click into, let’s say, there’s customer address, and then there’s nested field in there, like their whole, you know, this is a bad example, but you know, street address, city and then country or something, then you can say, Okay, this is, I want to redact the street address, I only care about especially for landing in my warehouse, I only care about the city and the country. So you can do that in the UI inside the GUI as you’re setting up this job. So masking redacting is a big one, you can exclude columns entirely and you can’t, you might discover that, okay, there are two columns that are actually the same thing. Maybe it’s like a phone number and a phone. No, right. And they’re like, one is 80% of the time when it’s 20% of the time or something like that. So you can coalesce those columns, again, within the UI. So there are these things that have the data pop up immediately and look through it, and make you configure those things. And then at the end of that, you can click when you say launch job, it’s going to start the job. But before that, it’s actually going to show you the sequel that we generated, it’s a sequel like, right, it’s an absorber sequel that was generated, that’s actually going to be the job. So if you are someone, if you are comfortable in SQL, for example, at this point you can add to that you can say, okay, additionally, I want to do other things like customizations, and so on, and so forth. And so and that’s the second user experience, kind of user experiences, you can, instead of using the wizard, you can just create a solver worksheet, write a bunch of SQL, and build a job off of that. Now, because it is SQL, and it’s like, it’s all hidden from you at surface to you, you can of course, just you can do your code version control and CICD off of that, just like Pope. And then the other ways in which you can create observer jobs is we have a DBT integration. So you can write DBT models that get executed over up the solver. We have a Python SDK. So if you’re writing Python scripts for your workloads, you can use that. And we have an observer CLI. So depending on what you’re used to and how you’re used to doing your work there, there are a few different options. And in every case, you can try to make as much available across the board as possible. You can imagine like in the GUI is like the trickiest to, like, include all the different quality checks and stuff, but I think we’re doing a pretty good job of that. The second question is, What are expectations? And how do we define them? So basically, it’s going to be, so the way you would do expectations in SQL statements, right? When you’re doing the row copy into Snowflake, for example, you say you’re selecting these columns, and then you do an exception clause, right? So write this except when, or something like that syntax or something like that. And then you say, okay, when, let’s say a state column is more than two letters or something like that, you know, states are all given us two letters. So something like that, you can say, Okay, if this happens, so just like you would write an SQL, like, if I had, like, for example, a string column, I would read the same thing. Like, if this doesn’t match this regex pattern, for example that I’m expecting, then I don’t want it or the difference here is that you can say what to do in the case of a failing expectation. So you can say drop or warn or something else. And that sort of helps make the process go faster you’re not making decisions like the data flowing in unnecessarily like us, didn’t you have that information later on to adjust accordingly?

Kostas Pardalis 49:19
Okay, that’s super interesting and is like expectations, usually targeting let’s say, Euro level data or like Golang or a table like what’s the granularity like people commonly care about because okay, you said like the example about the regular expression. So I guess, let’s say we are expecting like credit card numbers, right. And we want to make sure that they follow some pattern right and if not, we are so this is like on their own level, right. But do you see like people also doing like more how I’ll tell you that like, holistic kind of, like expectations, like the distribution, for example of these columns should be between like this and that or Yeah, like something like that, like, what are like the most common things that you see out there?

Santona Tuli 50:17
Yeah, that’s a great question. So we’re adding that. That’s exactly why I’m working on a PRD. For right now. It’s like what sort of, for numeric fields, what sort of aggregate things we’re going to calculate on the fly and present. Again, as I said, like, there’s in their observability page, like all of these things are sort of there. So I want to surface for example, like, you know, quantiles, relevant quantiles, and max and min and stuff, we do a little bit of that column level, I mean, we do have, we have column level properties, and observability tools. Usually, it’s like last seen, first seen the top values, null density, and things like those things that are, you know, that are really useful. And then you can query that table and put conditions on that. So if you say, like, my phone number column, let’s go back to that. If it’s, if the node entity suddenly increases to like, 5% or above, I’m getting Nelson then, like, do something like let me know or Alert, alert me or something because people are not filling in their phone numbers or something. So you can do a bunch of things that have already surfaced to you, at the column level you can use to create custom alerts and stuff. But I want to do more, I want to put it on my product hat, right? Because you asked this question, and I’m sure the other data, you know, data experts are going to think the same thing as, Okay, I don’t just want role level expectations. I want column level expectations. And I care about this and that. So these are all things that we’ll be adding as well.

Kostas Pardalis 51:57
Okay, that’s awesome. All right. One last question from me. And then I’ll give the microphone back to Eric, as we are approaching, like the end of the show here. So you have like a very interesting journey, you started from doing some very helpful, like detailed work, like one of the most detailed and like precision work that someone has to do out there, like working and trying to reveal it say like, how nature works in like, the smallest possible, like, granularity that we can reach as humans. You did data science in the industry. And now you’re doing a product, right? So it might not be completely accurate, what I’m going to say but like, let’s say you go from like something very specific to reaching, like the point where you mainly have to deal with people at the end as a product person, right? So it’s not just the technology out there. It’s also like the people and people unfortunately, are like very hard to predict, right? Like and understand and communicate with them. So precision to Wagner is like you go like, you travel like this spectrum. And I want to ask you, from this, like a unique experience that you have something specific above like the data, practice, right? There are like two main, let’s say, approaches when you work professionally with data. It is like the exploratory part, right? Like there is the discovery part, like the science part where you have to sit in front of a notebook with a bunch of data, and you try with some kind of goal that you have to figure out something that’s like, very experimental, right. And then you have on the other fancy engineering side, which has to be very strict, like we have pipelines and pipelines are very well defined, like we can’t randomly choose steps, right, going from one to the other. And somehow these, let’s say, kind of like extremes in how we perceive the work, they have to work together, right. And as you’re working on a product, where you have this vision of allowing, like every practitioner like data practitioner to work, like from the software engineer that’s doing production stuff down to the BI analyst, or like the data scientist, and like in between also like the data engineers. How do you see, first of all, like how, how do you think of a problem that is like bridging and how it can be achieved?

Santona Tuli 54:42
Yeah, well, there’s so much in that question. So this is how I see it. I think everything starts with that exploration. Whether you’re doing engineering work, data, work, product work or physics work, I think that the exploration has to be there, and the more you take that away, the more for everyone in that workflow team, however want to describe it is disenfranchised have a, like opportunity to be creative and innovative and really see like what’s going on, which is fine. Like, depending on the scale, it might be that not everyone can do the exploration, right? And maybe you have to do the exploration and then decide, okay, this, these are the things that need to happen. And I’m gonna commit to these, and I’m going to delegate. And then that is okay. But everything somewhere else begins with that exploration. Right. As a product manager, as a product person, I think that the most interesting step is going from that exploration to the spec? Right, this is what we discovered. These are the assumptions that we made on you know, you know, a lot more about the product than I do, you know, codifying that, and then saying, okay, and this is the spec for requirements for what we’re going to build. And then another very interesting handoff to me is from PRD to ERD. Rates, if the product requirements, Doc, what’s the engineering requirements, Doc and water? You know, what are the trade offs there? So I think the way I approach it is, I think all of these are like handoffs, and like sort of changing the lens of looking at things, these are very interesting to me. So more than a challenge, I think of them as an opportunity to like, learn more and figure stuff out and dig deeper. And then with that, I will say, it’s also super fun to geek out and just like implement something. Right? So for me, the biggest thing is enjoyment, I guess, is the thing that’s coming out from what I’m saying is I really like doing this version, I really like translating and making those different models and different phases. And then also, like, it’s really fun to just go and like, bang your head against a, you know, a segfault or nowadays it’s, uh, you know, Python, traceback, or whatever, you know, and just do it. So for like, maybe this is a good thing to sort of retrospect on, I just find enjoyment, I think, on all of them helps bridge that gap is one thing, just from a very personal point of view. From a team cohesion point of view. I mean, tooling can help, certainly, and that’s why we’re building what we’re building in that, like, no man’s land between Dev and data, but also just collaboration, right, like everyone talking to each other. And, you know, it’s figuring out that I wrote about this a few days ago, like, as, like, It’s everyone’s fault. And it’s no one’s fault. As a data person, if I just worry about, like my stakeholders, my business partners who are downstream of me, and what they need and try to get them what they need, then I’m doing a disservice to folks who are upstream of me, the app developers who might also need something back from me. Right? Not, it’s not just that we have to agree on a contract between what they’re producing and what I’m accepting, but also like, they want their analytics, or they want me to have some sort of flexibility in what I’m expecting from them, and so on and so forth. So that communication and collaboration is, you know, table stakes.

Kostas Pardalis 58:05
Yeah, that’s awesome. Thank you so much. Eric, the microphone is yours again.

Eric Dodds 58:12
Yes. Well, as we like to say we are, we’re at the buzzer, but time for one more question. And I actually want to return to physics and your time at CERN. I couldn’t help but wonder if there were any things that surprised you in terms of sort of discoveries, or, you know, you as outsiders we hear about, it sounds really crazy to collide these particles at really high speeds. But as an actual physicist, was there anything that really surprised you as part of that experience? Colliding particles?

Santona Tuli 58:46
That’s such a great question. I will, instead of taking a bunch of time to think back on my whole time, there, I will just say the thing that came to my mind right away, which was, I was surprised to hear that a holy LHC was shut down because a beaver had cut into the wiring in our tunnels. So maybe there’s a lesson there, right? Like you make grand best laid plans, and then something happens and throws a wrench in it.

Eric Dodds 59:17
That is, man, is that not the universe saying that? You know, it’s really hard to beat nature. And it will just kind of do what it does. So wow, that is hilarious. Awesome. Well, this has been such such a wonderful time. Thank you so much. For coming on the show. We’ve learned a ton.

Santona Tuli 59:42
Thank you so much. Thank you for having me. Nice meeting you both.

Eric Dodds 59:45
It is always a joy to talk to a nuclear physicist about data. And boy, was that a great episode? There’s just something about someone who’s collided with particle holes at insane speeds that you know, it’s just fun to talk Talk to them about almost anything. Great. So Shawn Jenna from from up solver was just a delightful guest. And for someone she is so, so smart on so many levels, right? I mean nuclear physics, colliding particles at CERN, working in natural language processing, working as an ML engineer, and she’s so down to earth, and approachable and just really a delight, like it was really fun to talk to her. I think one of the things that I found really interesting, and actually, I mean, there’s so many things about up solvers that were interesting. And sort of, you know, focusing on the developer, as opposed to the data engineer for a pipeline tool was really interesting. But one of the nuggets from the show was how she talked about the differences of working with data as a scientist, a physicist, and working on data in a startup. Because there are some similarities, that there are a whole lot of differences. And her perspective on that was so interesting. And I think that it was interesting, because she took learnings from both sides, right? And so she, from her perspective, there are things that the academic community can learn from startups. And then like, you know, vice versa. So that was a great discussion.

Kostas Pardalis 1:01:22
Oh, 100% I totally agree with you. First of all, I think it’s hard to find people who can do something really good. Like, even like an average job to be honest. across the spectrum of like, different disciplines that she has, right now about someone who has gone from crunching numbers about somebody atomic particles at scale, right. And when I say at scale, I mean, not just like, all of like, the infrastructure needed there. But like, at scale of like, the teams like, it’s like, literally like 1000s of people, they have to cooperate, like to come up with these things. And doing data science, doing a mole Ward, and becoming a product person, right? He has, like, that’s, like, a crazy, like spectrum of like, skills and competence, like the person needs to develop, to be good at all that stuff, right? So first of all, I think like, just for these, like someone should listen to hear, because I think it’s like, only shown like, very unique experience to at the same time, I think you taught some things about, like, the differences and the, like similarities with about, like, working with data and different like environments. And I think it was like, What is like really fascinating, in my opinion, when it comes to data as like infrastructure or products, or whatever we want to call it, because data is, is a kind of like assets, that there’s no way that you’re not going to end up having a diverse group of people that need to be there, in order like to turn it into something valuable, right, like think that the things that we talked with here about like talking from the engineer, who builds the actual product, even like the front end engineer, right, like and you have experienced, like with RudderStack, for example. And the words that this person is doing actually affects like, all from marketeers bi people have met might even not know that they are in the company, if the company is big enough, you know, like, they don’t care about that. And you need to build products that can accommodate all these differences like becoming the Glue in a way you lie between, like all these people to make, like this whole process of generating value out of this data, like as robust as possible. And this is not just like an engineering problem. It’s not just like figuring out the right type of technology. It’s also deeply about how to avoid human problems, like because there has to be communication here. Right? So figuring all these things out, I think is what creates so much opportunity in this space. And it’s, I’ll keep something that she says that wherever there is challenge, there’s also opportunity, right? And like, that’s, I think something that’s super important. And there are big challenges right now in this space, which means that there are also big opportunities. So I would encourage everyone to go and listen to hear. It’s a lovely episode and many. There are many things that like to

Eric Dodds 1:04:50
definitely, definitely want to check out. Subscribe if you haven’t, tell a friend and tune in to learn about nuclear physics. and data, and we’ll catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.