Episode 71:

ETL at the Edges with Jimmy Chan of Dropbase

January 19, 2022

This week on The Data Stack Show, Eric and Kostas chat with Jimmy Chan, Co-Founder and CEO of Dropbase. During the episode, Jimmy discusses how to use data cubes, getting data into a manageable format, and Dropbase integrates with the rest of the data stack.

Play Video

Notes:

Highlights from this week’s conversation include:

Jimmy’s career background (3:01)
How to use Data cubes (5:52)
What Dropbase is and who it is built for (11:01)
Getting sales and marketing data in usable formats (16:46)
Ensuring data remains flexible and transferable (28:36)
Defining what “offline data” is and how to use it (34:09)
How Dropbase can work with the rest of the data stack (43:30)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Automated transcription – May contain errors

Eric Dodds 00:06
Welcome to the data stack show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run a top companies, the Digitech show is brought to you by Rutter stack the CDP for developers, you can learn more at Rutter stack.com. Welcome back to the data sack show. If you’re watching this on video, you can see that it’s evening on the East Coast. And midday on the West Coast, which is what we get in the winter. We are going to talk with Jimmy from drop bass. He’s the CEO and co founder and really interesting company, I’m just going to give you a little preview here. You can load data into drop base and its database included. So first company like this we’ve talked to, which is fascinating, I have a lot of questions as sort of the user of this type of product, because I think would have helped me a lot in the past. I think my main question is going to be about the architectural decision, right? I mean, we talked so much about Cloud Data Warehouse, datalake, etc, and how that’s the modern architecture. And they chose to do what Jimmy calls batteries included, which is database included. So I just want to know where that came from. And I think his his past working with data will inform us on that, but cost us tell us what you’re thinking about. Yeah,

Kostas Pardalis 01:26
I want to learn who’s the user like we live in the time where it’s all about, like the data engineer, data engineering teams infrastructure for data. And we just assume that like everywhere there is a data engineer ready to do. I don’t know, like, anything you want with your data, obviously is not true. So yeah, I mean, it’s obviously like, from a business perspective, if you think about like a very underserved segment of the market right now. So I really want to see like who the users are, and how do they feel about these and how they use it. And the other question has to do with the data sources, because I think that we are going to hear some more, let’s say unique data sources that they are encountering them, like maybe things like FTP, or email and stuff that usually like an engineer does not consider as a data source. Right. So I think it’s gonna be very interesting.

Eric Dodds 02:25
I am amazed that you did not say you wanted to ask him what database they’re running under the hood? I mean, isn’t that like,

Kostas Pardalis 02:33
what? Well, you started and you were saying that you want to talk about the architecture? So

Eric Dodds 02:38
I’ll leave that today. Okay. All right. Well, I know the questions are gonna come up, we’ll see who gets to that question. Sorry. Let’s jump in and talk with Jimmy. Jimmy, welcome to the dataset show. We’re extremely excited to chat with you. Thanks for having me. Well, give us a little bit about your background. And what led you to drop it

Jimmy Chan 02:59
sounds good. Yeah. So I have been directly in data for the last five years through to my startups. And so more indirectly with data for almost 10 years now, I was an early adopter of Tableau, a top Tableau BI tool. And that was one of my first shops. And this was at the time when the company was still using data cubes. So it’s been quite a while, data management was a little bit harder than it is today. So that was sort of like where it’s coming from. And since then, I’ve always excited about just the data space, and how useful it is to have data that’s accessible, but also extract insights from it that you can use to make business decisions.

Eric Dodds 03:38
Sure. Okay, so I’d love to talk a little bit, we have so many things to talk about as it relates to drop base. But really quickly, so being an early adopter of Tableau. So when I hear Tableau, and I think a lot of our audience may feel this way is my first question is, was it fast back then? Because you know, today, if you have a big Tableau implementation, it’s like, kind of slow, a little bit cumbersome. I mean, super powerful. But what was it like back then, I mean, because that was kind of a pretty cutting edge BI solution that allowed you to do analytics that were even harder before.

Jimmy Chan 04:13
I think it’s, it’s been just as fast as before, but also just a slow at the same time, right. And now I’ll explain what I mean by that. So the BI tools have always been just as fast as your database can produce and transform data for you. And back then, because the company was using data cubes, if I needed to perform different analysis, it was much, much faster to actually have the cube generated by a data engineer first. So you’d have to submit a ticket, get it, get it, get the cube set up, and then you’d connect your tablet to it and then your slides that Tableau was a lot less feature for back then it did the basics pretty well. It was still very impressive at the time, because you know, it was it was very visual, you could drag and drop I can drop things. And it was just really, really cool. And I think on that basis, they got a lot of a lot of customers to sign up. So it was faster and slower and slower because of data infrastructure that we had back then. But faster because it was less feature four as it is today. And then well, today, people just have way more data. And Tableau itself. It’s just a lot more complicated product, super powerful. But it’s very complicated. It does a lot of things. Right. And so it slows down a little bit to

Kostas Pardalis 05:26
one thing before we continue, because I don’t think that we have thoughts before the what data to be sure. Jimmy, would you like to give like? Yeah,

Jimmy Chan 05:37
sure. Yeah. Absolutely. Yeah. So I should have mentioned that before. So. So the cubes are, the simplest way to think about them is as an array, sort of like an n dimensional array, where you just have that array pre computed for the purpose of being able to pull it faster with a downstream data, right? The difference with how we do it today is that if we use, say, a data warehouse, or something like click house, right there column based analytical databases, and they’re massively parallel, you can compute columns really quickly, right? And but before, it wasn’t like that, we didn’t have snowflake as well, I think maybe they were alive back then. But it was nowhere. But it is today, right? And so people were still in these old systems. And you start to precompute, a lot of these cubes, which were arrays, and you can have to set on them, right settle on them, right. So like, you have like product, and then price and then time, and then that’s your cube, that’s a three dimensional cube will do those three dimensions, right. But if you needed something else, it’s like, well, you need to just build a new cube, and then store that cube. And then you can pull data from it. So that was just Just as how it worked before. No, not too different from the concept today of kind of like transforming your data, and then pre computing some tables. But your tables can still have more flexibility than the pre computed cubes that you used to have before.

Kostas Pardalis 07:02
Okay, and how it was mainly implemented, like how you would implement like a data cube? Was it something like the realize view on a Postgres database, or like something else? Like, yeah, no, they

Jimmy Chan 07:14
were using these, I think it was like Oracle databases, I don’t really recall the exact technology they use. But it was, it was like, It’s not the kind of tool that you would pick just as your first choice today, if you were to set the IT infrastructure. And that, let’s just put it that way. And then within those database tools, they have the concept of cube sort of embedded in that you could like, write some queries, build a cube, he would schedule the computation of the cubes beforehand, because you had to process on the actual database, which were also slower back then.

Kostas Pardalis 07:44
Okay, let’s, that’s super interesting. And what happened to data cubes today, right? Are they still a thing? Or?

Jimmy Chan 07:51
Yeah, they I don’t think they’re as popular people don’t talk about them too much people don’t. Like companies are not signing up for a data warehouse to be like, Yeah, I’m so excited about building cubes today. Today, it’s more about flexibility, right? It’s about being able to quickly get the data you need to do the analysis that you want. So people sign up for a highly scalable data warehouse, where they can just store all the data. And then you know, they can transform data as they need them. And so you create maybe like tables that you can then use to perform other transformations, but you don’t have to run these cubes are so rigid in structure, that if you needed to do something else, you have to recompute everything again,

Eric Dodds 08:30
one quick question there. And you know, it’s, it’s funny, I find myself looking back on technology that in the world of tech in general, like actually, isn’t that old, but relative to the tools we have today? Just seems so antiquated. But I’m just interested to know, did it feel rigid back then? Or was it just kind of this is convenient, like you can build a cube and that makes Tableau more efficient? Like What Did it actually feel like using data cubes back then?

Jimmy Chan 08:59
Yeah, it’s I think, at the time, it felt like, it still felt slow, though. Like, I think as humans, we know, when something feels slow, it just feels like no, because like, anytime you want it like a like slice of data, you had to wait for it. It wasn’t like, it wasn’t like, boom, like you got a table. And that’s it. So it felt slow. But at the same time, it felt like, well, I mean, how else are you going to do this? Right? Yeah. So so that’s that’s generally how I remember it feeling like back then. Yeah, I just

Kostas Pardalis 09:27
keep thinking of like, what are the differences and, and what there’s common between, like the concept of a data cube and what we try to do today with using something like DBT, right? Because at the end, like DBT defines a table at the end like that’s what we get at the end, but couldn’t be like a data cube right? Like it is like a multi dimensional data set that we are going to use to do whatever we want to do there. Do you see like, any anything in common There

Jimmy Chan 10:00
are parallels, I’d say there are parallels just the concept of pre computing data for the purpose of accident Cygnet faster, it’s a very common, it’s just very common thing to do. Right. It’s like when you think about just like computer systems to write, you have, like, you have storage and memory, you cannot always want to have something quickly available for use. And so the principles of it is the same. But it but in practice, they’re a little bit different in the sense that even when you use DBT, to pre compute your data, like they still remain in database tables on that same warehouse, right? And so, whereas before, it’s like, you could have like a 10 dimensional cube, and you have to pre define it so explicitly. Yeah. And you have to run it on your tired data to be able to use them. So that was that was really painful.

Kostas Pardalis 10:47
Okay, okay. I think we had enough with the data cubes. So you were asking something, Eric, and I stopped you to ask about the data cubes to

Eric Dodds 10:58
hell, it’s great. I think one thing we’ve learned in the show is that we can learn from the past, which is great. And we hadn’t talked about data cubes yet. So I’m so glad you brought it up. Enough of the past. Tell us about drop base, and what the product does and who it’s built for.

Jimmy Chan 11:14
Sure. Let’s jump from the past to the future. Yeah, so. So jump base is just an end to end platform to automatically import clean and centralize your data in a database. And this is database included. So we’re basically a batteries included solution. And we make it very easy for teams to automate their data work, and to spin up their infrastructure in a very simple and intuitive way very quickly, so that even if you don’t have a full fledge data engineering or technical team, that you can still access some of these tools that are beneficial for companies such as databases, and quick data pipelines. So that’s what we did today, which

Eric Dodds 11:55
is very cool. And I’m so interested to know. So lots of ink spilled about modern data, warehouse, data, lakes, that infrastructure sort of standing on its own. What were the variables or observations or needs that you saw, to build a product as database included? I just love to know the thought process behind them?

Jimmy Chan 12:20
Sure, I think you’d always answer that question, if we go back to the kinds of problems that we solve the kinds of kinds of problems that we observed from our user base, right? And so at a high level there, there are three problems, right? There’s the how do we democratize data? And how do we automate their operations? And then how do we help them set up infrastructure, right? And along democratizing data, it’s people are outgrowing their spreadsheets, right? People start out their analysis with Google Sheets, Excel, maybe they export some CSVs, from some incompatible system. And then as their needs grow, they sort of, you know, they can’t use these tools anymore. And so that’s kind of like a basis from where we started from right users who are in that world, right. And then when these people deal with data import, right, people are sending them data through emails, or through batch exports, maybe they’re connecting like an SFTP server, they end up doing a lot of repetitive data cleaning work. So that is downloading an attachment from an email, and then uploading to some new system, meaning it and then maybe moving it to a database eventually, right. And what happens is, when a an analyst, for example, it’s building some data cleaning steps, or it’s building a data pipeline on their spreadsheet, it doesn’t really carry over to scalable data pipelines that you can then use over and over again. And because of that, so a lot of non technical teams ends up end up being paralyzed, right? Like they just can’t really do things without needing an engineer or maybe somebody else to help them set up a database. And even if they’ve set up a database, well, how do they move this data to it easily without writing their own scripts, right, and with all the data cleaning steps that they’ve added. And so if you look at these core problems, you know, people are growing their spreadsheets, people having to do repetitive data cleaning work every time to deal with a spreadsheet. And then the fact that they can themselves spin up data pipelines, or data infrastructure, those are sort of the core problems we see and say, Well, if we were to give them a solution like this, it’s going to have to be batteries included, it’s going to have to be something where they can create an account, create a workspace, upload a CSV or an Excel file, and immediately have it in a database that they can then connect to a BI tool, for example, or any other tool that connects with database, right? And so that’s sort of how we look at you know, the the evolution of like observing the problems and then saying, Okay, we must give them these tools. So that’s a really core part. The other part is that I think the this group of users kinda like the Forgotten users. Because a lot of tooling and products today focus on people who we assume already have a database. And I think it’s with big event, big milestones, events like snowflake going public, I think it’s going to drive this sort of move, where we kind of like leapfrog from like spreadsheets, almost straight up to data warehouses, similar to what we see with like mobile phones, right? It’s like the technology just makes sense. Why are we still going in such small steps, we can just sort of leapfrog it. And so we see a big portion of the market with these larger and larger spreadsheets and CSV exports are just going to need a database. And we want to be there for them.

Eric Dodds 15:42
Yeah, what one question, and I know that cost us I can see in your mind cost us you have technical questions about the database. But when you talk about outgrowing spreadsheets, I think about two vectors there, and maybe they’re more, but I’m just interested in this. So then these are the two factors that I’ve I’ve experienced in my past, going through the exact sort of lifecycle of spreadsheets that you’re talking about. One is complexity. Right? So I’m exporting marketing data, sales, data, transactional data, etc. And I’m getting really good at V lookups. And it’s like, okay, this is unwieldy, right? I mean, kind of the way it played out in my past is like, Okay, well, Monday morning, the first four hours are running all the lookups and everything to to get the numbers from last week, or whatever. The other is size. And I know that these are related, right? But you have to have a pretty powerful machine to like, run hundreds, hundreds of 1000s of rows in Excel, Google Sheets is getting better. But these things choke when you get to a certain amount of data. And actually, now with all the data that companies are collecting, like, it’s not that much data. So how do you see those two vectors interacting? Are there more that sort of force people past a point of like, this isn’t working anymore?

Jimmy Chan 17:09
Yeah, absolutely. So I think that’s I think those two vectors are quite accurate ones, there might be a couple more. So when you think about the VLOOKUP, in the complexity of it, and also just a requirement to have a big machine or a lot of memory in your local computer to run this. And then when you do a parallel to that with how you do it in the database, it’s just like a left join, right. And in today, if you had like two tables and snowflake, where you did a left join, you could do it on a million rows in seconds, right? So then the gap can be closed through user experience, right? So if we could just build a function or a UI component, that a user who is familiar with the look concept of VLOOKUP, could help them perform a left join in a database that maybe we could bridge the gap, right. So those two vectors are really important. I’d say the other one is a bit more about scalability and repeatability. Because a lot of the times, you end up just doing this over and over again, right, with a VLOOKUP. Let’s say that you have either the reference table is updated, or sort of the core table is updated. In either case, every time you get an update, especially to the reference table, the one that tells you Okay, Apple is in tech, and then the city is in finance. Well, you have to rerun the whole thing again. And that tends to be quite manual, you open your spreadsheets, you do the VLOOKUP. And then it’s there. Right? With with databases. And with tools like drum, bass or other other tools, you can just automate that process and make it more scalable and repeatable. Yeah,

Eric Dodds 18:43
I think, yeah, I think the other major thing, when you think about spreadsheets is human error. Right? I mean, that’s data is messy in general. And when you think about combining all these different spreadsheets, and then trying to use V lookups, and macros, if you’re getting really fancy to try to normalize all this stuff, it’s like, I mean, someone’s going to fat finger part of the equation, part of the formula at some point, and that’s obviously very painful, especially when it takes like 10 minutes for the for everything around

Kostas Pardalis 19:17
them. But Eric can do like, I can’t stop thinking while you’re thinking about spreadsheets, like can you think what is going to happen to our civilization, if suddenly like tomorrow? Excel disappears?

Jimmy Chan 19:30
Oh, no, it was a disaster. So you know, spreadsheets still hold a very important part of the economy together in some way. If you think about it that way there there are things that spreadsheets are just very good at what they do. And if it weren’t, if it wasn’t because they don’t perform at scale, you know, that people would just still use spreadsheets. They’re very powerful. Yeah, but yeah, the whole world would fall apart. Literally. If spreadsheets stopped working today, you’ve heard about the horror stories of big financial models like Like just build on Excel and mean turn on Excel?

Eric Dodds 20:03
Well, I think, you know, one thing cost us to that point, I remember a couple years ago, we were trying to solve this, you know, sort of in marketing, you want to tag all your links, you know, and so, zoom, this huge project for this massive company. And they needed all these permissions and everything. And so we like built a Google Sheet. To do all this, we had custom scripts running in the Google Sheet, we were like, hashing values with, like, you know, MB five, and like a custom Google script to like, you know, make the string shorter, and all that sort of stuff. And I remember showing it to my friend who’s a software engineer, he’s like, this is software. It’s like, you shouldn’t, he’s like, this is so brittle.

Kostas Pardalis 20:44
That’s through Eric. But on the other hand, like, if you think about it, it’s amazing how approachable the software development space has made, right? Like you have all these people out there who are actually like, not developers, but they can still like develop automations for their needs, right, which is, it’s amazing. I don’t know how much of it is just like that. It’s there out there, like forever. And, you know, like, there’s a lot of training and all that stuff. But as in their face with the machine, and like as a way to program the machine, I think it’s probably like the most successful their face so far. Now, this doesn’t scale. That’s a different conversation, right? But

Eric Dodds 21:32
and that particular project did not scan. It got really slow really quickly.

Jimmy Chan 21:37
I mean, yeah, about it, let’s say you had a bunch of different V lookups, or other sort of transformations in a collection of spreadsheets that collectively are like gigabytes of data. And then now you are, you’re told to scale that thing. So the first thing you do is probably you’ll contact one of your, you know, fellow engineers, maybe you contact them and you say, Hey, can you scale this up, right? And then they’ll look at you and be like, I have no idea what you did here, right? Like it is, ideally, it’s just like something where the person was building that spreadsheet, somehow can record those steps. And then we can take the steps and deploy those steps, right at scale, and then maybe it could work. But today, it’s it’s not they’re not built that way. Yeah, they’re not built to just transfer easily to code. Yep.

Kostas Pardalis 22:25
So Jimmy, can you give us like a small tour around? Like, how’s the experience with drop base? Like for a person who gets a file through any mail? And they want to use the product?

Jimmy Chan 22:38
Yeah, absolutely. So we have this new feature coming out, that’s called Drop mail. And it’s probably going to be the simplest way to get data from an email straight to your database. Right. So all you do is you open up your email, right? You, you type in a special email address, a special drop mail address, and then you add a CSV attachments. And, you know, let’s just send it. And on the other side, using the drop base dashboard, you can set up a sort of a pipeline where you grab that data, you apply some cleaning steps, we can even automatically map it so that it fits the database schema. And then we just automatically run it from there. So the experience, it’s, it’s quite magical, right? You just send an email. And then if autorun is enabled, it’s straight in your database. So if you have all your downstream tools connected to that, you can imagine that you to automate the whole process of even downloading that file, and then doing what you need to do with it, and then somehow writing a script to inject that data into your database, right. And so and the use cases, where this becomes useful helpful is we have users that are, let’s say, in the E commerce space, right, and they get shipment updates from their manufacturers, sometimes every day, right. And, you know, guess what, those shipment updates are gonna come from like, they’re going to be Excel attachments, to emails. And so now what you can do is you can set up a rule on your email, and say, every time it comes from my manufacturer, a, I want you to take this data to this pipeline, A, which goes to people a in your database straight in right. Now, of course, like your data has to be formatted in a particular way. So like a proper CSV file, and you can pre build some cleaning steps to it. And then that’s it, it just goes straight in. Sometimes they lead doesn’t come from an email, sometimes you just export it from a system. Typically, the systems tend to be more incompatible, like there’s no API to connect to them. So or sometimes it’s just privacy security issues. They have to take snapshots of it as CSV. And so the same thing is like, we want to update that data, upload that data, you want to clean it up and you want it in your database.

Kostas Pardalis 24:48
Okay, this is very interesting. And outside of the email, which I guess is like a very common, let’s say channel where like data is coming in. What other channels Have seen out there or communication methods or whatever, but we wouldn’t expect to see it as a method of exchanging data.

Jimmy Chan 25:08
Well, I mean, I think there’s this thing called EDI that big companies still use. I’m not sure if you’re familiar with the concept. I’m kind of new to it myself. It’s some sort of electronic data exchange protocol that was used before API’s were like a thing. And it was used by big companies to exchange data with each other in a way that was, you know, more standard, like more compatible. So those channels that actually are pretty big today, like surprisingly big because the big companies operate those, but we don’t hear too much about them. I only heard about this because we had a user reach out, like, yeah, like we’re an insurance industry. And then EDI is a big thing. And they’re apparently right. And so they have their own set of protocols and ways to make it compatible. They’re very different from API’s, but the underlying concept is the same. It’s like, okay, how do I connect to a data source that’s come from from an EDI product, so I can pull data in. So that’d be a new and unusual channel where data comes in. And then you have your, you know, your usual suspects. SFTP is Cloud Storage is you have your, you know, API, like pulling data directly from Shopify, or like QuickBooks or something like that. And then the offline sources, CSVs, Excel files, emails, and then you know, you can build a whole universe of sources

Kostas Pardalis 26:29
like that. Yeah, that’s pretty cool. And what kind of file types or serialization types do you support? I mean, you mentioned CSV. So I guess that’s like a very common one. Is there something else that you see like being used out there outside of CSV and Excel file,

Jimmy Chan 26:45
CSV is predominantly they’re pretty standard way to exchange either Excel files, as well in an Excel derivatives, you have like Excel workbooks. And then there’s other other ones that we don’t do today. But we we could do. It could be Jason exports, XML sometimes, and then some of the open open documents formats. But yeah, there’s still people using them. But they’re not as common as like CSV and Excel files, I’d say for offline, like for flat files, CSVs. And excels would be probably 80 90%. Of, of the like the offline data. Yeah.

Kostas Pardalis 27:21
You mentioned two very interesting theorems that usually conflict with each other. In reality, one is automation. So you said for example, you can forward the mail. And like if you have automation, or like everything point, you know, like you, like magically, you will see the date on your database. And the other is data quality. And I’m wondering, because especially like when you are dealing with CSV, which, for example, like you don’t have that much information about data types, right, actually, you don’t have information about data types, everything is a string. So on the other hand, of course, like you have a database, which is a strongly typed system. So how does this work? Like, what’s the magic there? What do you do there? Sure,

Jimmy Chan 28:05
yeah, there’s a few things we do to ensure that we can still ingest the data in a year, right, given that the database is strongly typed, we must ensure that the data that comes in fits in that schema, but without having to explain to the user all these things about types, right. And so what we do is we do a first pass in, in automatic inference of data types. So we try to cast things. If we see strings, that could look like dates, we attempt them as as data, right? And so we will help users doing some of this stuff, right. And then with integers and floating points or decimals, you know, that’s a little bit easier. And then so that’s one level of assistance that we provide our users. The second level assistance is an explicit transformation at the moment of ingestion, right? So they can say, look, I want this to be a date. And then we can, you can click and add a step that says, Okay, turn this into a date type. And then we will sort of force it as a date type. And then we’ll attempt to load that to the database. So if it’s a new table that you’re creating with from your CSV file, then that first set of types stablish, the table in your database, and then the next time you’re trying to append more data to the same table in the database through a new CSV file, we just automatically do all the mapping for you. And then you just click Load to database and in that stun, now there are cases where the data, let’s say you have like 1000 rows, and all the 1000 rows are properly cast as bait, for example, but there’s one row, that it’s just just an ambiguous state or it’s just a messed up date, those cause problems. And so the way to address those and when we’re thinking to address those is to provide a summary of all the rows of data that wasn’t compatible with the database and then provide the useful ways to transform the data so that it can successfully ingested in the database. But this is one of our key key challenges, we help solve both from a technical side, but also from a user experience side. Because if you’re coming from Excel or from CSV, like there’s no types, right, and the user might not know that you have that database, expected type. Yeah, we have to abstract some of this stuff away for for them through the user experience.

Kostas Pardalis 30:24
Yeah. 100%. That’s one of the I mean, it’s a very interesting, very hard problem on the end, because you can do some stuff, like you can infer some stuff, but you cannot infer it. Anything is everything, especially when we are talking about I think one of the most annoying types is Boolean, because like people can represent boolean value, like in so many different ways you have true and false, you have yes or no, you have zero and one, there are times have they just like merge all of this together. And of course, like when a human reads like an Excel file, it looks fine. Like you can interpret, like the semantics around the values. Right? But like, that’s not exactly true, like when it comes to the database system. And yeah, it’s very interesting, because it’s also like, there are two things there. One is you have the data that you cannot infer, and you need to keep them like somewhere so someone can go and transform them. But also, like, what I have seen is that you can be so aggressive with trying to adapt everything. And auto cast, let’s say where it’s very easy to end up in a situation with like your data set been just a string of the end, right? Which

Jimmy Chan 31:37
Yeah, yeah, absolutely. So just that that is the the challenge that we’ll have to solve is to how to over time get better at better at accurately inferring types, that is aligned to the user’s intention of what they want, right? What we don’t want to do absolutely, is we don’t want to lose precision in some other data you like, if they have data that comes as decimals like floating points, like, you know, 35 point 54? Like, you definitely don’t want to mess with that. You don’t want to say, Oh, just 35 and forget about the decimal part. So there’s things that we can be very careful about. But then for Yeah, for for the other problem. It is just about over time building a way to understand that user’s intention, and then maybe provide them a choice, something more explicit, something informed and something that they can take an explicit action to make sure it fits.

Kostas Pardalis 32:28
Yeah. Euro. Brianna, what’s the most annoying data type to work with?

Jimmy Chan 32:32
Yeah, you know, I bullions. Yeah, they can be difficult, I think, because they’re difficult, they just end up as attacks, and then you’re gonna have to figure out something, or you can transform it later from like, true to one down the road. I’d say it’s like when you get more advanced with your data, like if you’re thinking about like, location data, and then like, and then the different ways to store like, times and time zones. And, and so those become a little bit challenging. So today, we do deal with data with data types pretty well. But for location data, like, yeah, it’s just text today, like, we don’t really have a way we can do it, but we think it adds more complexity for users. And have location data be represented as location data explicitly.

Kostas Pardalis 33:18
Yeah, yeah. Yeah, time is just a pain in the ass. Absolutely. Like so

Eric Dodds 33:26
many men, because of applies not only to databases, but just to life. And

Kostas Pardalis 33:34
yeah, of course, I think I mean, data is a projection of reality. At the end. I think the most the most interesting part of time has to say, is about human nature and human communication. Because at the end is that like, we just cannot agree on how we won’t do any present the stupid thing which, you know, like, governs our life at the end. Like, it’s crazy. I mean, if you see time manipulation libraries, like the amount of work that people have put towards building these libraries is just crazy. Like, and I’m British, or let someone outside like in software engineering, they would consider like, Okay, what this time like, it means like, we have a clock for like, forever since forever, right? Like how hard it is like to get out?

Jimmy Chan 34:21
Eight an hour. Yeah, no, totally true. This is something where like, people more on the technical side, or like, you know, this is important, and this is difficult. And then everybody else. It’s kind of like, I mean, it’s just time, can you please add my time to the spreadsheet or to the date? I was like, Yeah, I wish it was that easy.

Eric Dodds 34:37
Yeah. QUESTION On. One thing I just like to drill in on briefly is you use a term offline data. And we haven’t used that term a lot on the show. Could you give us a quick definition for our listeners, because, you know, I think sometimes I can just refer to this data is in a CSV in an email. But also it can refer to types of data that, you know, sort of don’t emanate from the cloud originally, which can be some of the most valuable data,

Jimmy Chan 35:10
if we really mean a combination of those issues, any data. So offline is for us, it’s just any data that is not online. But it’s funny because like, in theory, you extract a CSV from an online system, presumably, right? But we really just mean files, CSV files, Excel files, and also data that’s sitting locally in your machine.

Eric Dodds 35:36
Yep. For sure. And so can you give us just a couple of the main use cases? Right, so we think about offline data? What are your customers using Drop base for in terms of types of data? You know, who are the users of drop base? And how are they using

Jimmy Chan 35:55
their e commerce is always a really good example. It’s a, it’s a market, an industry that’s really growing a lot. And, and people use a variety of tools to do this, right. So one of the things is, for example, it’s similar to the example before you have like, you know, shipping companies sending you shipment updates, and they almost always come in offline files, or flat files, you know, Excel CSV files, but then you also have companies that when, when their customers want to use the product, they first need to unboard a lot of this data to the database, right? And so they will export data from another system, they’ll convert it to a CSV, and now they want to be able to ingest that quickly, repeatedly, but fast as well. And so those use cases tend to be the ones that, you know, let’s say, if you’re a company that builds software for managing for insurance brokers, right, every broker has their list of customers. And guess what, like, they’re usually in an Excel sheet, or CSV file. Or maybe it’s in some system that it’s like, pretty old school system, right. And so now, if this company wants to serve those customers, they need a scalable way to help all their customers quickly get that data into the system, right. And so the idea would be that they can bring it, they can clean it, and then they can have it in a database so that their product, which is on top of that database, can then query it and maybe show some dashboards or visualizations or, or some analytics for them. So these use cases, yeah.

Eric Dodds 37:36
You know, it’s interesting. As you talk about drop base, it’s maybe one of the first times I’ve considered a product where when I think about the users, it’s, you know, sort of SMB, or enterprise, you know, like the non technical SMB, maybe I’m running e commerce, like, of course, it doesn’t make sense for me to have data engineers on staff for an enterprise where I’m dealing with lots of offline data. And that’s super interesting it would you say that’s true of sort of your users or what,

Jimmy Chan 38:12
yeah, that’s pretty accurate. Yeah, it is pretty accurate. So the SMB angle certainly is, you know, they’re starting with spreadsheets, and then those purchases get bigger, but they still know that they want to save time, they still know, they want to get this data connected, because there’s now a lot of high tech, like, you know, nice, fast b2b tools that, you know, that would be marketed to a lot of these smaller companies. I want to be more tech, tech oriented. But then the first step to use that tool, it’s like, hey, step number one, connect to database. And they’re like, Okay, you lost me there, right? I’m not sure how that’s gonna work for me. Yeah.

Eric Dodds 38:46
Yeah, for sure. Yeah, it’s super interesting, because in that, I mean, ecommerce, were just, you know, sort of early innings in terms of the growth there. But if you think about someone who’s running, you know, a successful e commerce Store on Shopify, like the machines are all talking together, right? Like, I mean, most of this stuff is sort of connected, or it’s completely disaggregated. And you get it in a CSV, there’s no in between, which is fascinating. So yeah. So that’s sort

Jimmy Chan 39:13
of the concept of the sort of the ETL at the edges, in the sense that like a big company, you know, today, you can sign up for data integration tools, right? And you can connect to different sources, right. And so let’s say you do ETL or ELT, right. But then there’s another set of customers typically, I think there’s smaller companies who are not super tech oriented, they may not even have databases. And so like for them, they still need a tool to get on that path of being of having a data stack to begin with, right. And so that’s where you can sort of extend the idea of like ETL, but extended at the edges, because there’s still a lot of data that is just trapped in offline files or systems. And so how do we make use of that data? How do we turn that data online, so it can be useful for that company or for that business?

Kostas Pardalis 40:00
Jimmy, I have a question that has to do a little bit with the architecture of the product itself sure that the product experience like the database is included there, right? How do you do that? Like, what kind of technologies you are using? And what was the reason that you decided to do it like this, instead of connecting to all the different available blank database.

Jimmy Chan 40:22
So our initial version of drop base, because we were kind of like just proving out the concepts, we build it on a Postgres database, again, you know, open source, fairly powerful, fairly flexible piece of technology to Postgres databases. So we started that way. And it was, Okay, let’s move that data to Postgres databases. But then, we realized that at scale, like when you hit like millions of rows, you know, a transactional like, database isn’t as good. Like, you definitely either need to add extensions to that Postgres. Or you just need to use a column base database or a data warehouse. So our new version of drop base, we build it straight into snowflake. So So with that, we have the benefits of security of near infinite scale, and have hyper fast querying, right, like millions of rows in a couple of seconds. And so given that, we expect our users to continuously import more and more data over time, and data that is generated in these offline files, or systems, that that data set would become bigger and bigger over time. And so having something like data warehouse, it’s super neat. We’re also exploring, looking into other kinds of data storage systems, or other databases or data warehouses that we could use to do this in a way that’s super scalable, super accessible, but also affordable for for our users. Right?

Kostas Pardalis 41:57
Yeah, that’s super interesting. So okay, dealing with snowflake, I mean, I have like, there’s still like some configuration that you need to do, right? Like, as a user, you have to select your data warehouse, you have to select the size, like we have like these kind of like parameters there. Do you handle that for your customers, or they have to figure this out, we

Jimmy Chan 42:23
handle that for the user. So user signs up for drop base, and they get their own private database instance. And then we make certain opinions about how we configure that database for them. So that from their side, all they’re doing is they’re just creating a job base account. And they immediately get access to that database. So they get credentials to the database. And we mirror our permissions in Trump base to those permissions in snowflake so that if you are an owner of the workspace, you also are the owner of that, that snowflake database from a credentials perspective, obviously, and what this allows us to do some pretty neat stuff, right? Like we can, we can help users manage access to the database. That’s the first thing and the second thing, which is sort of some principles that, you know, we we really like, which is a user should able to be able to access and control their own data, is, even though we’re managing the database for them, they can at any point access their data, because they have the credentials to do it.

Kostas Pardalis 43:25
Yeah, that’s, that’s really interesting. And I guess, maybe I had a question I wanted to ask you, like, if you have seen like your system, installed together with traditional data warehouses out there. So like a company might be using, like both, for example.

Jimmy Chan 43:41
Yeah, so not currently. But we we expect that to happen over time, right? Like we expect, those are the two ways that this can evolve, a user starts with drop base. And if they’re sort of on the smaller side, they could grow without base. And eventually, they can say, oh, you know what, like, this is our data warehouse. Now we’re gonna buy other tools to maybe do data integration, maybe we do bi, but then I can just use drop base to do all of this. So that’s, that’s certainly one way it can go. Right. And the other way you can go, it’s just you, you start with draw base. And then that is just one of your data points of data integration. So that becomes one of your sources for a larger stack that you’re that you would have built or that you already have at your company.

Kostas Pardalis 44:27
Mm hmm. That’s pretty neat. And what’s your experience so far of using snowflake? As, let’s say a component of your product? I think we had another company that we took like in the past in another episode, where they did something similar they were doing they have built like actually built a lot of logic on top of snowflake, they pretty much implemented like algorithms on top of that to do their stuff. So it was very interesting. And it seems like it becomes some kind of pattern They’re and make sense also for snowflake, right? Like, what they try to do is like, move away from being a data warehouse and becoming a data platform where you can build products on top of it. So how’s your experience so far,

Jimmy Chan 45:14
there are many trends that that lead to that like the first one is you’re right, it’s just snowflakes desire to be a database or data warehouse for more things and more companies out there. So something is actually started a sort of a startup program based on where they work with startups to use providing, providing them with some credits and some expertise, some time with our engineers to discuss how to build products on snowflake. And, and by that they really mean use snowflake as the data layer as like the data store for data with the startups products sitting on top of it, and having a pretty tight integration to it. So they’re certainly that is one of the appeals to this is that all snowflake wants to do this. And so they’re investing in this. And the second one is that snowflake as a data warehouse is fairly mature, it’s fairly sophisticated. And it has a really good, let’s call it API, where you can basically interact with the underlying database, in programmatically basically can do a lot of stuff, you can generate permissions on the fly, you can generate tables, and connect permissions to those tables in different ways. And so it’s fairly feature rich, fairly programmatic, right. So as someone building a product, like you have a lot of flexibility, we’ve seen other databases with try with me, we constantly do research about other database that we that basis, we can integrate with, and their their API, sort of their development developer ecosystem isn’t just quite there yet. But we always keeping an eye on this just so we can offer more choice, I think, and also to address a previous point, in terms of database included a portion, certainly for the smaller companies, we want to provide them the database included. But for the other extreme, which is companies are maybe a bit bigger. And maybe they already have like be an infrastructure set up, we’re going to have a sort of a BYOD be like sort of like bring your own database when your own snowflake approach where if they want our user experience, you know, they can provide this snowflake instance, or that database instance. And then we can sit on top of it. So we built our product to be our architecture to be flexible enough to swap databases out. And then still work.

Kostas Pardalis 47:33
One last question from my side, and then I’ll be the stage to back to to Eric ease course of snowflake a concern so far,

Jimmy Chan 47:42
snowflake is a bit on the pricier side. So snowflake. So Sophie has some natural, so some pretty good characteristics, right? Like in terms of like separation of compute, and storage is really nice, because we have people importing a lot of data, but not processing it, we don’t have to have a huge machine that’s constantly hosting that data in that database, just so we can store the data, right. So that was a pretty nice, like, you know, architecture for snowflake that makes it nice for us. Pricing is more, you know, more expensive. But we are able to actually make it cheaper for our customers, because we are this an awfully customer. And so we basically have like the the accounts with snowflake, right. And so I don’t know if you knew this, but snowflake has this, I think it’s like by the second billing, but with a minimum of minutes for their compute. And so the devil is in the details of that, right? It’s like, in theory, it’s really nice if you can charge by second. But if you have a small like an SMB, right, that’s running a tiny little query, that’s gonna take what like a second or two most right, and so and so for the other 58 seconds that was caused that I guess we have to cover unless we have enough scale that at any point in time, there’s always queries running. And then we can we can get some economies of scale with with snowflake. So from the purposes of like building product on snowflake, I think with no scale, it becomes a lot more expensive for you and potentially for your end customer. But at scale, I think it’s it’s going to be pretty neat. Like I think at scale, the math does work out pretty nicely. And with the other properties of being able to program that database through their API’s, and having storage and compute separated. Those are really really nice properties. Obviously the security aspects of it. A lot of it comes built in.

Eric Dodds 49:37
Well, I’m gonna we have time for just two more questions. One may be more complicated, the others not and I’m going to put my business hat on here because cost us and I involved as former entrepreneurs can’t help it, especially when it comes to data. But building on snowflake to me, especially with the SMB and enterprise approach is brilliant. Because SM bees are going to move towards a database, like snowflake, right as they sort of grow and expand and you know, want to do more stuff. And enterprises that rely on a lot of offline data are also going to move towards snowflake, you know, as they want to modernize their, their warehousing solution. And so the transfer ability to me, especially when we think about snowflakes ability to sort of syndicate data is really interesting. Did you think about that, as you were sort of building drop base? I mean, it’s, it’s amazing to think about, like, Okay, I’m gonna adopt snowflake and like, great, like, it’s just gonna work, and then, you know, as an SMB, and then the same thing from the enterprise side.

Jimmy Chan 50:44
Yeah, I’d like to say that we had all that foresight that the made is we are making, we are making a bet, basically. And our bet is that the smaller companies the the now, so not so techy companies today, that they will leapfrog like a basic database, and then they’ll go for the warehouse. And I think that data warehouse companies like snowflake and others, I think they’re also going to want to sort of like, you know, tackle that market. So there’ll be building new features, I think that we’ll be making their, you know, the pricing structure, the business model, eventually, I think it makes sense for every business to have some sort of database, right. And, and I think for us, it’s more like, Okay, well, that end user might not know how to make that choice of a database. So why not give them something that is scalable from day one that is highly secure, and something that we have very high level of, of customizability, in terms of like how we can program it and develop features around it. And so that as they grow, they can still stay within it. There’s no need to migrate from like, databases to something else. But that’s like that sort of, I think he came afterwards. It wasn’t like we had this from the beginning. But we were making a bet on that. People will leapfrog and they’ll they’ll want databases that worked for them. Yeah. And so that’s sort of like that. That was our starting point.

Eric Dodds 52:05
Yeah, super interesting. I mean, in many ways, it’s kind of, you know, you think about the database as a core piece of infrastructure. And then the question kind of becomes interface, right? Like, how do you interface with it and different teams are going to interface in different ways, which is fascinating. Okay, final question. Drop base. If if listeners want to check it out, or try it out, where did they go?

Jimmy Chan 52:26
Yeah, thank you. Yeah. So they just go to drop base.io. And they just sign up for a free trial. And they can start using it right away by creating a workspace. They have a database from day one, a database that they can connect to, you know, Tableau, Looker mode, retool really anything they want to because they have credentials to access that database.

Eric Dodds 52:47
Awesome. Well, Jimmy, this has been a fascinating conversation. Thank you for I mean, we’ve talked about things that we haven’t talked about, you know, over 70 episodes on the show. So thank you for that. A lot of fun. Thank you guys. Yeah, it’s great. So thanks for joining us, and best of luck with the jump base.

Jimmy Chan 53:04
Thanks, Eric. And thanks, Costas.

Eric Dodds 53:07
Okay, my take away, which may be obvious, we just don’t talk about offline data that much. And my guess would be and I actually wish I could go back and ask Jimmy, what his gut sense of this is, but the amount of data that still gets shared via spreadsheets via email has to be enormous. I mean, that has to be an enormous part of the market. And obviously, there are dynamics that are changing that. But it really is crazy to think about, I mean, you know, we just don’t talk about that a ton on the show. But there are people whose jobs revolve around sharing data in, you know, flat files, or Excel files over email. And that’s probably bigger than, you know, sort of the slice of the world that we see who’s, you know, trying to do like advanced bi with, you know, open source visualization or whatever.

Kostas Pardalis 54:00
100%. And I think it’s probably like something very common in commerce, especially now that pretty much you know, like, every company also has like some kind of digital presence. But we have to remember that it’s not just the physical, there’s also like physical presence, right? So there are still people going into stores and buying stuff. And yeah, there is like an employee there at the end of the day, who is probably, you know, completing an Excel document and sending that back, like to the headquarters or whatever. And that’s like a lot of data and important data, actually. So there is Shopify, but not everything is like on Shopify, right? Yeah, it was very, very interesting to to chat with Jimmy. I’ll keep like two things from the conversation, one that has to do with how I mean for people that are working like you know, let’s say at the edge of technology that keep forgetting that Technology exists for like, I don’t know, like 50 years now. So there’s a lot of legacy systems out there that we need to keep supporting. And we just forget about that. The other thing is how creative humans can be right? So do you see that like, email becomes like a transportation layer for data. And there is a third choice I said to you, but there is also a threat there. And that’s like the, I think we go to glimpse today of the rise of the data platform, right. And something that I have a feeling that we will see more and more in the future, and we will see more products that are being built on top of not just AWS, but on top of snowflake or on top of data, bricks. So these are like the three things that I keep from this conversation.

Eric Dodds 55:52
I agree. I think that was it was a small part of the conversation. But I think that was probably one of the most interesting points. That’s the first company we’ve talked to who is building on the snowflake infrastructure, I think, I think, Oh, that’s right. That’s right. Yeah. And that’s fascinating. Yeah. Especially to hear about sort of the cost arbitrage, if you want to call it that. And the way the economics work. Yeah. And

Kostas Pardalis 56:19
there seems to be also a lot of efforts from snowflake itself to boost that. So anyone out there who’s like considering building a product, like on top of data, maybe they should reach out and see like, well, this little problem is,

Eric Dodds 56:32
yeah, yeah, it’s embarrassing. I mean, I’m gonna look it up just because I’m interested. All right. Well, thank you for joining us. Of course, subscribe if you haven’t read episodes coming up, and we’ll catch you on the next one. We hope you enjoyed this episode of the datasets show, be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me Eric DODDS at Eric at data stack show.com. That’s E R I C at data stack show.com. The show is brought to you by router stack the CDP for developers learn how to build a CDP on your data warehouse at rudderstack.com

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 71:

ETL at the Edges with Jimmy Chan of Dropbase

January 19, 2022

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter