Episode 63:

The ETL – ELT Flip With Ciaran Dynes of Matillion

November 24, 2021

On this week’s episode of The Data Stack Show, Eric and Kostas have a conversation with Ciaran Dynes, the Chief Product Officer at Matillion, a powerful and easy-to-use, completely cloud-capable ETL/ELT solution.

Notes:

Transcription:

Automated Transcription – May contain errors

Eric Dodds 0:06
Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies, the Data Stack Show is brought to you by Rudderstack the CDP for developers. You can learn more at Rudderstack.com. Welcome back to the Data Stack Show. We are really excited to talk to Ciaran Dyne from Metallian he leads up product there. And he has a really long history of working in data costus, I’m really interested to ask him about Metallian specifically, and we’ll probably talk about lots of things related to data in general. But there are a lot of sort of ETL or as we’ll talk about T tools out there. And I’m really interested to know how Metallian does things differently. I mean, they’re a really successful company raised a huge round. And so I’m excited just to learn more about you.

Kostas Pardalis 1:05
Yeah, absolutely. I think the cover is like a quarter of a billion so far like and they’re one of the leaders in this ELT space. So I think it’s going to be very interesting to hear from him both about like, first of all, like, we’ll chat with him about like ETL versus ELT, right, like, that’s one of the things that we need to know to ask him about. And yeah, I mean, Metallian has great exposure to so many companies out there. So I’m sure like he will have some great insights to share with us about where the industry goes, like, what the companies are looking for, how the data is used. And yeah, I think we are going to enjoy your conversation today.

Eric Dodds 1:52
Alright, let’s dive in. Ciaran, thank you so much for joining us on the data stack show. We’re really excited to learn about you your background and what you’re doing and to tell you,

Ciaran Dynes 2:03
Hey, thanks for having me, Eric. Nice to see you.

Eric Dodds 2:05
Alright, so you’ve been working in the data space for well over a decade. Do you want to give us just a quick background on where you came from and what you’ve done throughout your career?

Ciaran Dynes 2:15
Yeah, happy to give a quick intro, I’ve always been involved in integration software. Back in the day, I started with a software company in Ireland that was very much about integrating different applications and your listeners know about object requests brokers or orbs, they were kind of the precursor to web services. And then I kind of worked my career up in web services and enterprise service plus, mostly on the messaging side, how applications and processes got integrated in a BPM, business process management along the way, kind of ended up then did a lot of work on ATI. And so up, and a few friends of mine joined software company called Talent. And they invited me to join and it was kind of a bit of a breath of fresh air. I always found it kind of strange sometimes to explain what is an API what is an ESB to friends and family that probably bored senseless, listen to me talk about it. But I actually found data so much more easy to to explain, because it was just you could explain any kind of interesting analytics project and so many of them. I then worked my way along with talent, we went IPO. And then more recently, I’ve joined Metallian, kind of very much looking at how analytics, cloud analytics and database actually behaves in the cloud. But yeah, I’ve always been involved in integration software as the I think data came along, or data integration came along, and certainly just made. Yeah, the barrier for me to explain to friends and family what I do and make it mildly interesting, purely because I think people are actually interested in some of the big data projects. Sure we operate on.

Eric Dodds 3:44
Yeah, no, I’m laughing because working in data, and I’m sure cosas has had the same experience. You’re at a holiday party with family. And now what does your company do? And you pause for a minute to try to think about okay, how do I package this? In a way that’s that’s digestible? So quick? Did you just quickly explain what what a talon do and what is Metallian? Do just in case any of our listeners are familiar with either of those? I think most of them are, but just to kind of set the table for the conversation would be great.

Ciaran Dynes 4:16
Yeah, so the type of area that Metallian operates in is in the area of data analytics. I think a lot of people are familiar with data integration as a kind of a general term. But Data Integration means a lot of different types of things. It can be anything from data loading, people like five Tran Metallian, stitch data, talent, Informatica to do those things a lot a bunch of open source projects out there as well that do that the simple act of loading the data into, you know, a data lake or an s3 bucket or blob storage. That’s certainly one aspect of what we do in data integration. But it starts to get a little bit more than that. I think a lot of what we Metallian really focuses in on is how data behaves within data analytics and data warehousing was very much about data in a data warehouse? How do you emerge? How do you curate? How do you get a 360 view of a given data asset. And a lot of that information then ends up in Tableau reports, click Reports is very much about BI and analytics. But data integration itself is a bit broader. There’s also streaming and those kind of areas that have little or nothing to do and some respects to analytics, they can simply just be about, you know, moving data from one application to another, or maybe even just moving the data back again, you can imagine like, Okay, your ERP, every time a new customer makes a purchase on a website, ERP basically then is responsible for changing the inventory, doing the order or doing the whole cash flow process. It’s not really analytics, per se, but it certainly is a lot to do with data integration. So it’s a pretty broad, all encompassing term. Most of what we focus on within Metallian, though, is really about making analytics ready data so people can actually do the, do the tableau, do the click thing, build some reports. But then going beyond the report, it’s about can we take a 360 view of a customer patient employee, and start to basically connect that back into operational systems be the applications, customer experiences on websites, operational databases, and the just a scale of business? So a lot of different things pretty broad, but as I say, most of what we focus on is in the area of analytics.

Eric Dodds 6:25
Very cool. And I want to start out with a question, which is been just an interesting topic, in the data space in general, that I think and I hope that you and Metallian have strong opinions. So ELT versus ETL. There’s a spectrum of opinions on this. And I think some strong opinions on sort of which one is better that varies by use case. But what’s your take? And how does that look from a sort of actual product perspective? And Metallian?

Ciaran Dynes 7:01
Yeah, I think I had a phrase recently noticed that ETL should never have existed. We’ve simply said, that’s a pretty strong opinion for a company that does ETL. And it’s, yeah, probably as a strong opinion for somebody because ETL but the question is, why did ETL basically get created? It is a process after all, it’s just a way of taking data from a number of different source systems, merging it together and making the table that’s literally all it does. But I think the ETL versus ELT, you’ve got to look at how data warehouses were used, I think back in the day, and even if you want to go back, like priests, pre snowflake, pre Big Data ops, let’s say come back 10 years, they were these kind of precious systems, that people had a fear with the admin who owned it, and God forbid you went and asked them to basically run I actually don’t even know the word ad hoc, isn’t word you can use with a Teradata ad, man, it’s just like, What are you referring to go back, come back to where you came from. So it was very much about business critical financial critical workloads, which makes a lot of sense, right, that you’re paying a lot of money for some very, very highly optimized, amazing software. For there. Therefore, like the the ad hoc kind of analytics that maybe we’d run today, or even just the scale the analytics, it just would have broken the bank, he wouldn’t have been able to fund a an analytics project. So in that light, I think ETL got created. So sure, it makes the data warehouse run faster, in a sense that it can we can extract data, load data, curate data, but actually, it took a lot of the processing, the more ad hoc analytics processing outside of the data warehouse. And therefore you end up with these kind of dual parallel systems, your most important analytics happening in the data warehouse and everything else, just whatever it is outside in this ETL process with its own specialized software, and Ella’s own engines in some ways that had its own horizontally scalable engines clustering, all that sort of stuff exist and ETL products. Whereas we fast forward today, and you even look at snowflake, data, bricks, any of the kind of redshift, they even don’t even describe themselves as data warehouses anymore. They’ll describe themselves as a cloud data platform and all that kind of stuff. But when you peel it back, you kind of see as well, actually, their utility, the the the barrier to go and run our process, ad hoc or otherwise, even the most important, it’s like it’s like $2 A credit. So you can go and as a team, just start to do your own analytics, just purely in isolation, maybe from centralized it and get on with it. And in that world, you kind of go and say, Well, where does the processing now belong? It’s like, well, this utility, this snowflake, this incredible, kind of linear, scalable capability I’ve got with all of my data at my fingertips. Surely the better thing to do would be to leverage if and not have a separate parallel system. So, why did I say that ETL shouldn’t exist? It’s because it does exist for certain types of use cases, but the ballance of processing, whereas previously it may have been like 80/20, you might have had a lot of processing and ETL and a certain portion of high value stuff in data warehousing, I think that’s reversed completely now when it comes to data analytics. When I’m thinking about the cleaning of the data, the preparation of the data, it all just lives and belongs inside the data warehouse. It’s faster, cheaper, better, more secure. It’s kind of like the Olympics. It’s that. Therefore, I think the real architecture design pattern for a lot of what we do when it comes to cloud data warehousing, it just belongs inside the the hyper scalers. And therefore, you should just use it, where ETL makes a little bit of sense still, as the loading or the extraction, there are some periphery use cases that make a lot of sense not to be done in a data warehouse. Back to our original definition of data integration. I think those things are kind of like either on the load, or on the extract or when you’re kind of doing like app to app kind of data stuff. What we use a ETL for is always about pushing down to data warehousing, I think that belongs in the Data Warehouse. And that’s why I think there is a fundamental shift that’s happened, where people really are now using an ELT architecture, maybe some people don’t even recognize it as such to go now it’s ETL. But that’s the product category. I think the architecture is really an ELT architecture. So no strong opinions at all a bajillion?

Eric Dodds 11:27
Not in the least.

Kostas Pardalis 11:29
Yeah. Do you see any kind of like, current like future use cases where ETL might still be relevant?

Ciaran Dynes 11:39
I think the one that we see is certainly ingestion, although some companies would describe it as just ingestion. The the act of having a SaaS application that just does the load, it makes sense there, right? Because it’s highly optimized. You can do streaming, you can do a whole bunch of other types of use cases. And the data whereas they’re not well, either they’re highly protected. You don’t want them make sense connected to the internet that way. Or they’re not not yet optimized for that use case by can see snowflake and others basically heading in that direction where they’re adding more streaming ingestion capabilities. But simply the act of loading ingestion, I think is it’s kind of ETL. Like, the other part is, is interesting is the last mile of analytics is where you have something highly curated, 360 view of a customer, and you want to, you know, synchronize that back into an operational application operational database, I think that is ETL. As well, it’s a separate process that sits outside the data warehouse, data warehouse themselves are not optimized yet to run a lot of services, even some of the work that some of the data warehousing companies are doing today. It tends to be it, how would I say it’s kind of limited in terms of some of the things that it can do. They’re not fully formed services in the way, like a SOA architecture would think about a service or even a container or micro service, tend to be extremely tightly bound to do functional things, where state and history and other things basically don’t apply. So I think at the edges makes a lot of sense, IoT. But again, are we still doing ETL at that point, or is it more like a streaming use case as a Kafka as a conference? I think there’s other technology out there that does those really effectively. But I think ELT is an architecture makes the most sense today in terms of how people are using data in the enterprise. Cool. So

Kostas Pardalis 13:36
if I’m thinking about like how I someone is like doing ETL, with something like Spark, for example, right, you have the extraction parts, you will write some code for the transformation. So in transit, the data is going to get transformed. And then it’s going to get loaded on the destination that in our case, let’s say it’s like a data warehouse or a data lake. How like this transformation part, which naturally deals like a piece of code that writes, How does it happen in ELT? Because this is part is like posting to the data warehouse, the data warehouse is the technology that like primarily, is being developed to ask questions and get replies to this questions. Right. So how do you see this implemented? And how ultimately and is doing it? If there are like multiple flavors out there of like, how to do that?

Ciaran Dynes 14:25
Yeah, it’s very interesting question. So if you look at we spent a lot of time working with data bricks. On the way there architecture, a lot of my background is with Spark Technology from my previous employer. Arguably what they do though, is that separation of compute and storage, so their compute and their separation of storage, it’s not any way dissimilar to what like a cloud data platform does. It’s just different technologies, and they have different smarts and the different schedulers and they have different histories. But both the basically what they’re doing though is they’ve, they’ve separated the storage, they have a way of clustering the computer There’s a schedule, they break the task down, they do a whole kind of MapReduce kind of behavior. Like I look at that long and hard the fact that one uses spark one uses Python uses SQL. It that’s the modern architecture in my mind. That’s what it is. That is the ELT. I think ELT of yesteryear is synonymous with, you know, SQL only. And it’s only working with cloud, cloud data warehousing, or even just data warehousing. But if you look at data, like the lake house architecture, from from data bricks in the way that our SQL analytics platform behaves, I mean, it’s like you can push SQL Python, pi spark into that engine. It look after the scheduling and the splitting of the tasks works with no intents and purposes. It’s still an ELT architecture, lots of ways in that there’s a logic that’s sitting directly on top of that data virtualized, if I was to take the exact same problem and move it over to snowflake, I probably get an old to do work, the same behavior might use different technology might be SQL based, but pretty much it’s the same thing, you’ve got access infinitely all to all the data storage, and you can spin up the compute as you needed. It’s not like it’s a completely separate thing I respect, I think that’s how we would we would consider it and we as Metallian, we just generate SQL for Databricks. Totally optimized for their platform. If we take the same design, and we shoot it over to redshift or snowflake, internally, we will just generate different SQL to leverage that platform because they have some specializations and variants between each of them. But to use an end user, you just see a design. But under the covers, we are basically leveraging the ELT architecture, maybe I couldn’t convince Databricks to call what they do in ELT architecture. But at the end of the day, that’s separation of compute and storage. It’s that I think, is the modern data data architecture that people are looking to leverage. And the fact that the storage basically is like literally just infinitely scalable, and so ubiquitous, that you just can create materialized views in the data and use it for multiple different things. I think that’s the big game changer that we basically are witnessing,

Kostas Pardalis 17:13
and how it works with material like how, what’s the experience that someone has when using material.

Ciaran Dynes 17:20
So our experience, I guess, borrows a lot from a no code, low code IDE, drag and drop where you are designing a logical flow. So things like you take a data data set, you tend to kind of almost get a a table view of your data tends to try to flatten everything into a table. We think that tables are, I guess, easier for most human beings to kind of mentally construct and we’re dealing with analytics people. So ultimately, something that makes it into a table. It’s easy to store easy to filter. And it’s easy to basically pivot. That’s That’s the moral of the story. But actually, under the covers, it isn’t all normalized, it isn’t all flattened, it’s a highly structured internal data model that we have is just that the visual cue that you see on top of that is just to make it easy for you to use it. But it is very much a drag and drop metaphor. It has a lot of Ethernet, you know else logic that you typically see in ETL style products. And from that, we kind of create a visual logical documentation of the analytics that the end the developers trying to come up come up with. And then when they go to run the product, I think this is where the real kind of smarts and materion kicks in, we start to do a lot of live sampling with the data. So you kind of construct a piece of logic under the covers, we’re creating SQL interacting directly live with snowflake, we’re validating that SQL is valid, and then we’re producing a sample data set. So you can actually see whether or not to that design part of your of your of your structure, or your flow is like, Okay, up until now, I’ve got my Salesforce data looking kind of correct. Maybe I’ve normalized that at the US and the European dates, because it tends to be the case that in Salesforce, you run into that issue quite a lot. Okay, great. What’s the next thing I want to do, I want to bring in my Pardot data, I want to kind of merge those based on a particular primary key. And the visual cue basically helps you continuously just iterate, iterate, iterate, by the time you get to deploying that data into snowflake, or whatever your data warehouse is of choice, pretty much certain that the table structure on the logic is correct. The only thing potentially goes wrong is just that, as you went through the sampling, you didn’t realize that that sample set wasn’t representative of the total underlying data set that can happen, you tend to only see a couple of 100 rows. But maybe the underlying data set is fairly on the rows. And that’s why when you flush all this through into the data warehouse, you can then go and check it and visually check to see if you’ve actually corrected all the errors in the data. But it’s very much a visual metaphor. We try to get as much as we can to add no code or the below code, but we have a million extension points where people We want to plug in SQL, things like Python, you can even plug in or code and things like DVT are all fair game for us, you can plug in those those capabilities. And we simply just orchestrate across all of them. But we’re trying to get a visual document representation of your analytics. And like last I checked a couple of customers last week, like seven different data sets is kind of like the form for anything that’s moderately like considered what we’d call an insight. But like, we’ve got customers at 2627 different data sources to produce a marketing lead score. We’re trying to hand code that and you can write, it’s just a maintenance, iteration upgradability of that flow. That’s where we think that the visual look and feel of the product really starts to come into its own, as well as that sampling capability, which we think is really just a killer capability that as you design, you see live data. And you see the logic of what you’ve designed. It’s those things that basically are the powerful capabilities that Metallian offers out. Again, it’s all an ELT architecture. So we’re directly operating on top of your data warehouse.

Kostas Pardalis 21:06
Yeah, that’s, that’s very interesting. And who is the user of material

Ciaran Dynes 21:11
user for us as a data engineer, but the problem with that term is that means a lot of different things. So ETL, engineer for 100%, that’s just that person who’s used to the ETL design paradigm. data engineering is a broader term data engineer for us is anybody that could be doing things like airflow orchestration, yeah, they could be hand coding, but you’re looking at long and hard enough, it’s just a different tool set or different stack for them. So we try to blend both of those in where people who are more used to that kind of engineering background, which is CICD injection of control, hey, they probably grew up writing Java code using Spring framework for all I know, but that’s just me, but But that type of person is now coming into the data world. The reason being is I think it’s because people are recognizing that the resilience of the data pipeline is is a word, you hear a lot of like the downtime of your of your data. And I think engineers have been really good. Certainly SRE cloud ops engineers have been really good in terms of figuring that problem out. And I think that is influencing that data ops thing is strongly influencing or has influenced Metallian, in terms of how we look at orchestrating those pipelines together. So we have these ETL people, they’re very much looking at business logic, business data. And their job is to take what you know, your CFO wants to see in terms of revenue forecasting, and that type of thing. But there’s a whole bunch of other people around that who are kind of building on the periphery, the connecting, and the loading of the data into the bronze sort of storage. That engineer is also part of what we do when they there. They’re different skill sets. But the complementary in nature, I think we tend to separate that there’s like, almost like a mini SRE team, which are data engineers that surround these ETL. Engineers, the ETL engineers are really looking at the the actual design logic, creation of this master record of something that’s tends to be that the two groups that tend to use are

Kostas Pardalis 23:10
Yeah, makes the whole sentence like it’s a great point that you’re making here. Because many times, you hear people asking, like, Okay, what is a data engineer? Like, why do we need another discipline engineering rights? And, actually, I think that the best definition that I can personally give is that data engineering is like, hybrid between operations, SRS, or they said, and actual, like software engineering, because you also have to do that. So you pretty much like to be a successful data engineer, you need to have like knowledge from both like, you need to build your pipelines, but at the same time, then you have to monitor your pipelines and care about this delays and have them like, up and running and all these things. And I have a question, which is actually like something that I find interesting in general, not just like for data products, but how is this visual metaphor that you described, fits into, like the workflows that engineers have, like developers have, like all these CI, CD versioning? All the stuff that like tools and methodologies that engineers have to like, support the quality of their work, like how does it work?

Ciaran Dynes 24:23
Very interesting question. I spent a lot of years basically looking at CI CD version control, I couldn’t resist, it actually must be better. We’re gonna go back to the whole I want to go back 18 years, I was a 1.2 ClearCase admin. I spent Bhaja time being an engineering manager. And I had to be the Cleese ClearCase admin because there wasn’t just nobody else to do it. So I kind of grew up in that whole strong version and control that IBM Rational products had and then others have the products that come along these days. Everybody uses Git or Bitbucket, those types of things. But the whole notion of version and branching and merging and those types of capabilities. I just don’t think it’s not that it’s not natural, but it’s not in the kind of the purview, I think of the ETL. Engineer certainly hasn’t been. But it’s definitely something that engineers are just going to go well, that’s how you do it. So we’ve tended to see is the capabilities that are kind of the get like thing with version control a branch and emerging, they’re becoming commonplace in the data products, the ETL stack, we may not use the same like labeling and the way it’s visually shown to the end user, as the way an engineer will be comfortable with, when it’s the same thing. And actually, under the covers, we’re actually using Git for that matter. Like, that’s how we do our version control. And it’s very strong version control, and very strong branching and merging. But I never haven’t yet exposed that terminology to the ETL. Engineer. I don’t want to scare them. But I think they like the fact that they can roll back and they can share, they can do all those things. And then it goes further than that, right? Because non repudiation of aversion is becoming a really important thing in our world, because we operate so quickly, at some point when something breaks now breaking could be just that a pipeline doesn’t run, could be a security issue could be something else could be something more nefarious, right, that a bunch of records basically appeared on the internet, and lo and behold, we didn’t mask something property, somebody’s got to go check out why. And maybe there was a mis configuration of a rule inside one of the ETL. Pipelines are one of those particular products that’s not versioned and controlled and checked in. You have no idea and a lot of ETL down through the years. It’s just not that it’s like almost like where we got an analytics project. Great. How does the data work? It does this thing. Okay, how do we know we’re being successful? Because the head of sales basically hasn’t given out to me this week, like that was the testing, right? It was that and then you come along, and you kind of upgrade or migrate that. So we don’t know, how do we retest? Well, we check to see if the head of sales is giving it to us again. And then we know that the report looks like it’s correct. But that’s not good enough, I think in clearly in, in modern enterprises. So I think the CI CD is here to stay. It’s just that we don’t necessarily expose those features the way we would to an engineer, but we’re actually still using those under the covers. So that’s how we experience it. But it is strong versioning, for a lot of good reasons. But a lot of it comes back to we just simply make we think it makes the data boat go faster. Because upgrades and migrations and all those things that happen all the time now. And reuse is really well supported by those principles.

Kostas Pardalis 27:39
Yeah, that’s super interesting. And they’re like two theorems that we hear a lot lately. And they’re like many companies are getting funding to build products around that, which is anything around data governance quality. What’s your opinion on these? And how do you see also these kinds of functionalities, how they play together with ETL tool, or ELT tool, like like material,

Ciaran Dynes 28:05
very interesting one, I’ve spent a lot of time over the last number years building data cataloging technology, and data governance technology. And I’ve kind of seen a grow up and then during COVID, I wouldn’t say it waned, but it has basically found maybe some of it some of its place in position. So it’s a case of going I think cataloging capabilities can really dramatically improve analytics, they really promote very strong reuse. If you can extract a lot of the semantic meaning of data, you can do really cool things like you can start to automatically infer if the data is good or bad, or have a standard or non standard. And that stuff comes I think a lot from the principles of what data governance teams and product can do. They’re very good at looking at metadata, they’re very good looking at relationships. And if you can put that stuff to use, you can ultimately solve the big problem, which is the data quality problem. So I think a lot of what the governance products can do is provide really good semantic understanding of data that could be used not just for the purpose of governance, but actually more importantly used for the purpose of data quality and fixing data or automatically detecting and indicating there’s something wrong with the data. A lot of the governance products that were really good friends with collibra. A lot of them basically is this some ways at a different level that they’re kind of like a ticketing system whereby approvals. And there’s like data custodians, and people who own the data, have to basically approve it for unsanctioned for use, but I think they’re only really at the beginning of that industry. I think it’s like, yeah, we’ve seen massive innovation there in the last couple years. But I think it’s gonna be more interesting if you if you look at what snowflakes doing around like the Data Cloud, this idea that there is this massively curated sets of reference data sets, it becomes really interesting that you start to play Then some of the principles of of the catalogs and the governance in terms of where did that data go? And how does anybody know after it’s been released in the Data Cloud? So I think governance is is interesting that it has a whole new innovation area that I think it’ll eventually end up in. But I think primarily right now is, I’m fascinated by the use of the metadata that governance tools have. But to actually go and fix the quality problem, I think that is actually a problem we should go fix. I think it’s not even just practical. It’s like, we have to solve that problem. And I think governance is kind of like an interesting secondary issue that a lot of organizations have. But everybody has a data quality problem, like everybody has that problem. So I think for ETL, OS, we use the metadata to go fix it. And then we partner with the best of the business, the likes of elation, and collibra to help their customers do what they want, and what they do in terms of approvals and all those types of things. But to me, I’m more interested in the use of the metadata to go fix the quality issue.

Kostas Pardalis 31:03
Mm hmm. super interesting. Ah, okay. So ETL, or ELT, as we call it today. I mean, it’s something that exists pretty much since we create a database, right? So we might keep reinventing it. But like, as the process might get something that’s exist, like for, like, forever? What’s the future? How does it look like? How do you see it like, based on your experience, like with material, what is next, both for material and also like for for this category of products?

Ciaran Dynes 31:31
I think you’re right, I think it does, every once in a while a blog will come out, usually by a Data Integration vendor that ETL is dead, just to kind of rescan it and say it’s not like that it takes on a new life of its own. What we look at right now is this, what we call a definition of the modern Analytics, which is a combination of BI and data science and operational analytics. So in that respect, what I think is that you’re right, ETL is here to stay. But I think the future of ETL is back to what we talked about in terms of the operations. It’s really about not just automating much more, it’s much more resilience in those pipelines, like how can you detect that something is going to fail before it fails? And like we can really solve that problem today? How can you do things like get a job to optimize itself? Those types of things are definitely starting to become real quick things that we can actually go do. Because we’ve learned a lot more about the relationships of the data and the query optimizers inside the data warehouse are becoming a little bit more accessible in terms of how the API’s would work. But I think that’s where I think a lot of the ETL is got to go is that, can we detect errors before they happen? Can we alert people, but then can we auto detect that something is could be better optimized by automatically tweaking the configuration, and the only way we can do those things as a like we’ve got API’s, we have the ability to inject variables. So again, good engineering principles. And then it’s actually about leveraging the API’s, all those underlying platforms where they have really smart, intelligent things built in. And we can basically promote different attributes, different ways of configuring the optimizers. And those optimizers, then help the actual job run better. So it’s like that kind of as a, an ecosystem are a kind of a sense of, if you can bring together all those kinds of capabilities, the ETL become smarter, more resilient, more optimized in the future. But I do think it comes back to that is that we’re trying to solve the problem of Big Data Science and operational analytics. And ultimately, that’s going to be about making the pipelines run faster, with more resilience. And then using the data, curating it much more and reusing that the insight that we actually generate and curate. That’s what I think the future is. And that’s exactly what we’re building at materion. We call it the data operating system, we think that companies need to run their data as an operating system. And then operating system by definition is modular, smarter, more resilient, more scalable, than the way we used to look at ETL. You know, that say last year that you’re performing?

Kostas Pardalis 34:09
Nice, one last question for me, and then I’ll give the stage two to Eric. Okay. About like the destinations, I think the set of like possible destinations that are like, it’s pretty limited, like we know it’s like all the data warehouses that there are there are not that many anyway, but about the sources and based on like your experience as multiline with all the different bike companies that you have interacted with what are the most let’s say common ones? And also can we break them down into like some categories of like sources or like distinct in some way?

Ciaran Dynes 34:43
It’s great question. I think it’s it’s a real bugbear I think of all software vendors right now is that everybody ends up basically becoming a connector company in the integration world. And a lot of it’s down to the fact that customers are not sure of its own willing to get where they want. It is Every connector has to be supported. So then everybody basically does the same thing over and over again, we all end up with hundreds of connectors. And then lo and behold, AWS will change a security profile, come out with some new Iam service. And you’ve got to go and iterate through 100 connectors, and we all do it right. Every single one of us doesn’t matter, you’re going to rock up to your next big $1 million customer come January. And they’re like, Hey, do you guys support a some API from some new CRM that you haven’t heard from before? To break the back of that problem, we’ve been kind of looking at, hey, we’ll give you a way a no code toolkit, he pointed at the API. And we will automatically construct a matillion connector under the covers to try to alleviate some of that need for the vendor always to be building out the connector. connectors for us largely fall into really just two very simple categories. At maturity, we tend to broadly look at batch oriented API’s batch orientated data warehousing, JDBC connectivity, but now increasingly look a lot more at CDC and streaming API’s. So there’s a lot of work that we’ve been putting in, we’re going to announce it at reinvent in a couple of weeks, in the area of change, data capture and streaming the 10 to look at those API’s subtly differently, because the nature of the queueing capabilities in the queueing technology. And there’s just a whole other kind of service lifecycle that you have to obey and observe. That’s quite different with API’s in a sense of internet API’s, rest API’s, versus something like a queueing technology where you read it once at most wants delivery, all those types of things are very, very different. So I tend to look at them those the two broad categories. But ultimately, it comes back to is there just vertical categories that customers are interested in? Do you have a set of capabilities in finance? Are you guys really good with building applications? Like do you support Recurly? And all the rest of it, like the whole list of things like NetSuite? Yep. But I think for us, it really comes down to that though, the ingestion capabilities are broadly bifurcated into rest API’s databases, and increasingly now streaming API’s. Okay,

Kostas Pardalis 37:05
that’s interesting. And why do you like why See, this is important? I mean, recently, also like five throne like wives like a company that like specializes on CDC, we have seen CDC being mentioned that a lot of special like in big corporations, we had like someone like from from Netflix, they have done like a lot of work there. Why is EDC like offense? Because initially, I mean, the technology that is based on like, the replication logs of the database is right, like, it was built for something completely different.

Ciaran Dynes 37:37
Where every time for ETL is dead, I’ve heard CDC is good, I think a lot of it has to do with organizations right now are doing cloud migration. And they’re trying to digitize as fast as they can. And at the end of the day, they don’t have the ability to always change all of their on prem software at the same time. But they have the need to basically get that data, the change data sets into their Cloud Analytics platform. So I think a lot of it is for me is that they’ve selected Cloud Data Warehouse, they bought in very strongly to the vision of what that analytics can deliver. I mean, it’s true, right? I’ve seen it for myself, I can see what those platforms can deliver. But some of those changes in their business are so important. And they have to happen at a faster rate than basically a daily or an hourly batch load. That it’s like a if we could just use the CDC style of use case that would basically help our analytics. And I think it’s that state of affairs that we’re in right now. I do believe, though, that there will be another messaging technology, it could be Africa, it could be a new variant that comes along. That’s so ubiquitous and widely deployed within the cloud infrastructure. And it overcomes a bit of the kind of the complexity of the admin, that we could just see a replacement of some of that CDC style of use case, which is you said as the log, redo and kind of style, and it becomes much more of a messaging kind of push to a queue with topics, basically multiple readers. Right now, I think it’s just one of practicality, I think, used to basically doing the logging, we can’t change those operational databases, even if we wish, because it would just impact the business so catastrophically are so big. It’d be just too risky. So why change it might change what works, I think is what I’m observing. Like one in every four of our customers right now is like, what are you guys doing with PPC? And how do you get it into snowflake? So it’s not just a it’s a can you get it into snowflake in a highly resilient way? And I actually think it may be like, I guess proven a lot by the likes of the five Tran guys saying, hey, data ingestion is not just ETL and I was like, is like, Well, no, it’s different because what we’ve done is we’ve just said we’re going to solve the problem. Have loading data to cloud. And after that, do what you want with it. And I think CDC is set for a similar kind of rethink. If just get it out of the log file and stick it into s3, do what you want. After that, we’ll have an ordered, we will have a high fidelity, we’ll have a metadata log, we’ll have all of the information that you need to go and do whatever analytics you want, and as much of it as you want. And I think that’s the redo on CDC that’s coming. It’s optimized for the way we can do analytics in the cloud. I think that’s the that’s the evolution that’s coming rapidly in CDC.

Kostas Pardalis 40:38
Super interesting, Eric, all yours.

Eric Dodds 40:43
I know you keep going. I think we’re getting close to the close to the end here. Karen, I wanted to rewind just a little bit. You talked about modern analytics, as being bi data science, and operational analytics. And I’d love to drill in on that. And one thing that we’ve seen repeatedly on the show is that there are a lot of terms that I think a lot of people, including myself think are easy to define, oh, analytics, right? But then if someone said, Hey, could you give me a really good, you know, concise, articulate definition of analytics? I may have to stop and think about that, because it can be very complex and wide ranging. But it just really struck me that you sort of included three, pretty traditionally separate components in a single definition of analytic. So can you dig into that?

Ciaran Dynes 41:36
Yeah, I think for me, if I go back to where we kind of started here, we talked about ELT, and the kind of the benefits of cloud storage technology and separate of compute, like business intelligence, I’d love to know who came up with the term by the way, I think it’s fascinating, right? Because I think middleware companies and integration companies always strive for, how do we tell people our business value, we never really cracked the code that cracked the code. But the analytics guys basically cracked it 20 years ago, it’s business intelligence, it just sounds amazing. And really, what a lot of it is, is just to basically, as we, as you know, as providing reporting on data, there’s a lot of stuff that goes into making that happen. But when I look at data science, I don’t think it’s the same type of analytics in general, I think a lot of stuff we do in data science can be, but I think it’s sometimes observed that the answers that we get from data science are not always deterministic. That’s always the classic one. Sometimes they can be range based, not so within a particular range. The answer is somewhere here. And it’s a different type of thing. And you’ve got a whole bunch of techniques and algorithms and stuff that people have built up. So I won’t even go into all that. But I still think it’s analytics, but it’s a different type serves a different purpose. And whether you’re bought in on the volume and scale and all that type of stuff. Yeah, maybe but but I think it’s more to do with that the the answer is not always a single deterministic value. It’s in a kind of a range of values. But the last thing on operational analytics, I think that’s different only because it distinctly says, We wasted want to operationalize something that we’ve learned. And all it basically says is, hey, the last mile of analytics is not a divisional dashboard, it certainly is a great way to create a conversation with an executive team. But ultimately, like I always talk about leading and lagging indicators, we’re big believers of this material, there’s a there’s a framework called four dimensions of execution. And it’s the like, you define a wildly important goal. As a team, you set this notion of what’s a lagging indicator, which might be revenue or something like that our customer count, but nobody in a sales team can do revenue on a Monday because like, hey, the deal might even close for six months from now. But they can do how many customers they’ve talked to this week, how many demos that they don’t how many trials they got McHugh how many SQL they become great. They’re leading indicators of revenue. So we start breaking things down that kind of way, you start to get into this kind of like, okay, the visual dashboards that we use in lots of organizations are really around those things. What do you do with them, after you’ve learned some sort of an insight, you’ve learned that there’s a correlation between this and this, it makes so much more sense to take that insight and take it out of the Tableau dashboard, and give it back to that sales person who every day is making those cold calls, you’d say, hey, now for another list of 100 calls, we’ve created a few as the marketing team, we’re gonna stack rank them in the best order, we think is possible for you to call those customers because we think there’s a propensity model here that you need to know about propensity model comes from maybe data science, the underlying data sets comes from bi, bi, basically create a dashboard for me to go and say, Oh, that looks interesting. But Operational Analytics was to take that inside and actually do something with it in the day to day of the of the salesperson. So that’s why I separate them into three things is because I think they deliver different value I think they actually are subtly different use cases, even though they all can be additive to each other. And that’s what we define as modern analytics. So when you’re doing at least all of those three, we think you have the right to basically say, Hey, we’re a digital leader. And it’s, it’s Pete’s coffee, it’s slack. It’s the Juniper Networks. It’s those companies that work with us. And that’s what they’re doing. So I want to start off by saying, hey, these guys are using these output connectors from a trillion. Those are the companies that are driving us for those they want those insights that they’ve generated from bi and data science. And they want to get those marketing lead score out of the developed back into the Layers of their marketeers. That’s why we call that modern analytics, we think that really defines a data driven company.

Eric Dodds 45:49
How many companies so a lot of times I think about the journey, that a company has to go on to become data driven, call it digital transformation, whatever, right? First, you need to collect all of your data, and to your point, fix your data quality issues, so that whatever insights and however you’re deriving them, on top of all this collected data in your warehouse, are good insights. Number one, dashboarding, I would say is part of that. But a lot of times it’s sort of another step, right, actually sort of building dashboards, and that goes from executives are down to functional teams. How many cut? Like, what is the fall off of companies who sort of collect data, do the quality thing have good dashboards? And then how many companies are actually doing operational analytics? Well, because my sense is that it’s probably not many. I mean, you mentioned companies that we all want to emulate. But I wonder what penetration is with Aboriginal analytics?

Ciaran Dynes 46:46
We did a bunch of surveys actually this year on that. I can happy to share the data with you, if you wish, we saw that BI analytics. Yeah, no surprise, right. With like, material, it’s like in the 8080 percentile, right? Sure. And I was surprised it wasn’t 100%, i She was kind of scratching my head on that one. The data science guys really like the 4652, somewhere around there, because we run this survey over multiple days with different webinars that we run. And the operational exam between eight and 14%. Right. So it’s kind of it’s down there. It’s not basically as as done as much as maybe, as the other types of analytics, I guess that’s to be expected, people are kind of maybe only now waking up to, to do some of these things. I’ve got to believe that COVID has fundamentally changed the way we use data forever. I think a lot of what we’re doing right now, in terms of digitizing the business is never going to change, we’re never going to go back to some of the things we used to do. We’ve had to basically put a lot more reporting in front of people in zoom calls in things that were in Slack and click, see, we’ve created the we don’t print out stuff anymore, right? We don’t basically show up to the exact meeting. And here’s the print the exact pack. And the rest of it. The exact pack is basically a Google Doc, and the Google doc links into the tableau report and all those things. And then you start getting into like, what’s a hack day in an organization? It’s basically tick. So the cultural shift is rapidly happening. And I think it’s like, hey, it’s winners and losers, right? A lot of businesses will struggle to make it through this pandemic. And the ones who basically come out the other side of it, I think they’re the ones who are going like, hey, what’s, what is everybody else doing? And I think if they’ve at least made the journey to Cloud Analytics, and they have a cloud platform, they have a fighting chance of taking some of those insights they probably have already created. They’ve just got to basically get them out of the data warehouse and put it into some operational system, database, ecommerce website, something propensity to buy model, what’s the next product that somebody basically wants to select all those things? A few weeks ago, I met one of our customers. It’s an online webinar, I kind of jokingly said to him like, it wasn’t only consulting, but he’s insurance. I said, Hey, insurance must be the most boring data industry to work in, right? I said, So what are you doing in data science, probably nothing just to provoke a reaction. And he kind of laughed and said, Well, let me tell you what we do. Because you may have heard in America that sometimes we struggle with things like climate change and agreeing whether it’s happening or not, as a minor read that in the news, yet controversial to kind of dig into that with a customer on a live webinar. But he said, but we as an insurance company have to take a position on climate change, because we insure a lot of properties in the state of Washington. He looked at the state of Washington recently, a lot of forest fires and things like that. So they’re using a lot of GPS location data, weather reporting data, a lot of predictive models to incorporate that into the contract insurance information that they have to say, what is the likelihood of a forest fire wiping out town A, B and C next year? Now that is climate change? But it’s really interesting that as an insurance company, they have to take a position on that, because that is basically the future or not the future of insurance company, if they are not on the right side of that predictive model, like they’re in is the operational analytics that I’m talking about. If they’re doing bi, they move to data science. And now they’re basically building alerting into their entire system. Like, there you go in the most boring industry, insurance. They’re doing incredible things. And then I thought it was just hilarious, hilarious, big important for me. But I thought it was interesting that he, he chose to say that they had to take a position on climate change. So that’s the other thing I think is fascinating. During COVID, we are all more exposed to data and data, analytics and projects like never before, right? You think about it used to watch the news in America, the last 20 years, what is on the news that is to do with data? It’s all financial, Wall Street and stuff. And baseball, right? Do things. So the only two things, and now we have COVID? Right? COVID is the third thing that every day, its stats reports, all those things. That’s really interesting to me, in terms of that the our culture is becoming more data savvy, more analytics, aware of beyond the two popular ones as a set of finance and sports. We’re now basically looking at another dimension that is for a more scientific related, but look at climate change reports and how they’re denied. And basically, some people believe, and some people don’t believe those things, I think are going to generate a culture of people that have a greater I hope, awareness of the importance of using data to prove or disprove a theory.

Eric Dodds 51:41
Hmm, well, we could not have picked a better way to end the show. I think that was incredibly insightful I am with you, I hope that our societies do become more data driven, and become more analytical. Because I think that’s really healthy in many facets of life, not just if you work in data integration. This has been a really great show. And we’d love to have you back on as you settle into the saddle even more Antillean and continue to build some amazing things for your customers. Great.

Ciaran Dynes 52:15
Thanks for having me, guys. Really great to chat with you today and love to do it again.

Eric Dodds 52:20
Okay, that was a really interesting show. And this takeaway for me, is more of just a funny one. And it was an anecdotal observation that Kieran made, but he talked about the head of sales giving you hell, when something’s not working, or you have a data quality issue. I just I think we’ve probably both experienced that throughout our careers in one way or another. And I just got a kick out of the idea that the head of sales is sort of the most high impact data QA engineer that they’re

Kostas Pardalis 52:59
the question why these ladies know them. Salesforce is something that how many times you hear that? Yeah, like, I don’t know, like an amazing girl detection mechanism for data quality issues,

Eric Dodds 53:21
I think needs a propensity.

Kostas Pardalis 53:26
Absolutely, yeah. If you have like 100 leads with that. Yeah, it was an amazing conversation. I mean, gallons, like a person has like a huge, huge experience. She has experienced like ETL or ELT of data degrades mind, whatever we want to call it, I mean, different phases of the industry, with talent with Matillion now. And yeah, he said, like some amazing thoughts and like experiences with us. And I’m really looking forward to have him again, back to

Eric Dodds 53:56
the show. Well, thanks for joining us again, and we will catch you on the next episode of the data sacks show. We hope you enjoyed this episode of the datasets show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me Eric DODDS Eric at data stack show.com. That’s E R I C at data stack show.com. The show is brought to you by Rutter stack the CDP for developers learn how to build a CDP on your data warehouse at Rutter stack.com

Transcribed by https://otter.ai

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 63:

The ETL – ELT Flip With Ciaran Dynes of Matillion

November 24, 2021

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter