Episode 43:

Doing MLOps on Top of Apache Pulsar and Trino with Joshua Odmark of Pandio

June 23, 2021

This week on The Data Stack Show, Eric and Kostas are joined by Joshua Odmark, the co-founder and CTO of Pandio. Pandio is built on Apache Pulsar and is designed to help companies achieve their AI and ML goals.

Notes:

Highlights from this week’s episode:

Joshua started his first company at age 15 and then sold two more startups after that (2:15)
Embracing the open source movement and not reinventing the wheel if you don’t have to (12:15)
Pulsar seemed built to address Kafka’s weaknesses (17:23)
Using Redis as a coordinator for federated learning and taking advantage of its portability (23:05)
The pillars of Pandio and some practical use cases (31:24)
Feature stores and model versioning (38:23)
Seeing Pulsar as the future because of the ability to run tens of millions of topics (41:04)

The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:06

Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com.

Eric Dodds 00:26

Well, on today’s show, we get to discuss a topic that always brings a little bit of spice to the conversation. And that is Kafka and Kafka related technologies. And to make it even more interesting, we’re going to talk with the founder and CTO of Pandio and its tooling built on top of Apache Pulsar. They do a lot of other things, and we’ll talk about the ML orchestration and some other things they do. Kostas, I am just really interested. I always love a conversation where there’s very opinionated discussion around Kafka and things that compare with Kafka. So that’s what I want to hear. Maybe you’ll get to that question on the technical side, but I can’t wait to hear what someone building on Pulsar has to say about Kafka.

Kostas Pardalis 01:14

Yeah, absolutely. That’s also my burning question, to be honest. Pulsar is not a very new technology. But it’s gaining a lot of traction rate lately. And I’m very curious to learn why. What are the differences and why Pulsar instead of Kafka? And so that’s going to definitely be part of like the conversation that we’ll have today with Josh.

Eric Dodds 01:36

Great, well, we’re going to talk with Josh, the CTO and founder of Pandio.

Eric Dodds 01:42

Josh, welcome to the show. You have a long history working with data, and you’re doing some really interesting things. You’re doing some really interesting things at Pandio. And we want to talk about all sorts of things, including machine learning, orchestration and all that. But before we get going, could you just give us your background, and I think specifically you’ve done a lot of different things, but we’d love to hear about your journey, working at data-related companies, and maybe just provide a little bit of perspective on how things have changed over, you know, such a long career with data.

Joshua Odmark 02:15

Thanks so much for having me. I really appreciate it. So yeah, so my name is Joshua Odmark. I’m currently the founder and CTO of Pandio. We consider ourselves an AI orchestration company. So we help companies operationalize AI at the end of the day. I started my career incredibly early, this usually shocks people, but I started my first company when I was 15 years old. Now, it’s not as exciting as that almost seems. But I mainly started it because I did not like working your traditional jobs. So basically, what I did was just repair computers. So as my friends went to work at Burger King, or, you know, serving at a local restaurant dishwashing, I basically earned the same amount of money, but I set my own schedule. And that kind of set the bug early of being a serial entrepreneur. And then shortly after that is where my data journey begins. And it got very interesting. So as a senior in high school, so still not even out of high school yet. I started a company in the early 2000s with somebody that I’d only met online. I only knew them as an online handle.

Eric Dodds 03:25

Really quickly, what platform did you meet them on? Just because, you know, we’re so used to that’s being a common thing now. I mean, you know, first, like, digital dating and other things like that. But in the 90s, you know, there weren’t that many venues actually.

Joshua Odmark 03:39

Yeah, yes, it was, it was interesting. So ICQ was the main messaging platform that I communicated on. But what’s really interesting is, I mean, this is hard these days, you guys will probably appreciate that, but back then, if you like, emailed the owner of a website, they were like, happy to talk to you, you know, this, that was like, not a common thing back then. So you can easily connect with people online if they ran a popular website that you were a member of. And so that’s how I actually met this person, he owned the website. The website had a very popular forum. This was in the days of hot scripts, if you guys remember that, where you could go download little snippets of PHP or JavaScript and things like that. He had a competitor to that, that also had a forum. So I met him through that, and we just sort of, you know, conversated that way, either through the inbox of that forum, which was vbulletin or ICQ. So a little bit of AIM in there as well. The AOL Instant Messenger, but mostly actually over email because a lot of those things were clunky back in the day. People didn’t normally keep that stuff running 24/7, so it was very rarely a real time communication back then, at least in my experience. But I got to know him by just offering to help for free. That was very successful for me back then. It helped me learn, but it made connections. And we just grew to be virtual friends, if you will. And then we started a company. And the premise was pretty straightforward. This was back when Google released its PageRank algorithm. So we had the idea of taking his popular websites, which were already like PageRank, six, and seven, which is out of 10s. That was pretty decent back then. And we sold links, which is now not acceptable. But back then this is like a relatively new thing nobody was doing, there was no precedence. Yeah, it wasn’t good or bad. And we were able to get like Template Monster and Dotster and some of those people to join. And the whole point was to make money with the actual links, but also to double dip and promote our own properties as part of those links. And so we ended up having 10-plus websites that were huge amounts of traffic, so probably millions of unique visits a month, back in the early 2000s, which is, you know, today it’s kind of interesting, but back then that was massive.

Eric Dodds 06:20

And that was the like, I was gonna say, golden era, but, you know, there’s so much so many weird things, but for SEO, it was a Wild West. I mean, you could do I mean, you know, like the classic, you joke about it, but it’s like, you know, white text on a white background type stuff. But man back then, a lot of that stuff worked really well.

Joshua Odmark 06:42

Yeah, exactly. Interesting little tidbit, so you guys may have heard of Neil Patel who’s kind of a big name in the SEO space these days. We were one of his first customers, because what happened was, we started to get so many inbound requests to help with these SEO related things. We just farmed everything out to him. And so we ended up being one of his first and biggest customers back then. Yeah, so I don’t know him very well but I got to know him decently well, for you know, that was also all virtual. I ended up meeting him in person later, but at the beginning, it was all virtual.

Joshua Odmark 07:22

What’s fascinating, though, is this, this went gangbusters. It just blew up. Within the first month, we were making like $20,000 a month, which again, not a huge sum of money. But I’m also 18 years old in high school, you know, but that was just the first month and it kept going. It got pretty nuts. So that was my first foray, and what’s to me most fascinating is to remember, back then, there weren’t really any ways to analyze data. So I remember, all things considered, you know, it was fun making money, having success quickly, just being in high school, and not really knowing what I’m doing, etc, etc. What was most fascinating was all the data that those sites generated. So I remember like AW stats was a big thing offered through cPanel, back in the day, but it couldn’t really handle huge amounts of data. So I remember spinning up big boxes, which again, there’s no cloud back then. So you’re renting the actual hardware from a provider, which for us, was theplanet.com back then. So I remember, just renting this huge box, because I wanted to run SQL queries against all this data, see where people were visiting from, what they were doing on the website, and things like that. So that was, that was truly when my data journey first began, when I was understanding the power of being able to sift through these millions of visitors and 10s of millions of pageviews. And trying to learn what that meant, what sections of the website were popular, because things like Google Analytics, they give you these canned dashboards that didn’t really exist either. So you’re on your own to figure that out. But that was a wild ride. We sold it less than two years later. And I just cruised for five years, because that was a nice little windfall, not enough to retire on. But you know, I finished my degree and then got back into the startup world. But you know, it’s interesting from an entrepreneur perspective, that was my first proper startup, and it went gangbusters. And it was like, it was easy, you know, everything just worked. And then doing the next startup and the startup after that, I thought it’d be easy. But of course, you know, building companies is incredibly hard. So I’ve since sold two other companies, but you know, I had my fair share of failures in that mix as well. So it was a pretty wild ride.

Eric Dodds 09:50

But what a cool story.

Joshua Odmark 09:53

Yeah, it was tons of fun, and I met just lots and lots of people. It was really fascinating and It was just me and another guy that did it and, and the two of us built it into something pretty amazing. So that was a lot of fun. And, surprisingly enough, I was not a software engineer or anything of the type back then. I was more like a graphic designer in reality. So I was very much into the arts. I loved math and science in school and things like that. But I never really had a practical purpose. But as part of that startup, after about a year of doing this, it was mainly on autopilot, which was also fantastic. But our only big expense was programming. So I was like, oh, I’ll pick it up, you know, cut some expenses and learn something. That’s what it was like, oh, I love this. So I’ve been a software engineer ever since that day.

Eric Dodds 10:50

One question. And this is a little bit because I want to get to Pandio and I want to talk about ML Ops. And I know that Kostas has a lot of technical questions. But one thing that’s interesting, it struck me when you said, you were spinning up big boxes. I just loved this story. I hope for our audience… I know that for Kostas and me, it’s just bringing back a lot of really good memories, thinking about, you know, ICQ, and the AIM usernames that were kind of like a bad tattoo that you regret, that you had to stick with.

Eric Dodds 11:22

But you know, you talked about spinning up big boxes, because you wanted to analyze all this data with SQL, because you didn’t really have this out of the box like SaaS analytics providers. One thing that’s really interesting is you kind of have these phases, right? So there was, maybe more like bare metal, bare metal analytics that you were doing. Then you have this huge wave of SaaS analytics tools still around, right. But then you have a lot of companies actually coming back full circle to writing SQL on big data sets, you know, on the warehouse, or other different data stores. And then you kind of have this in-between with tools like Looker or leveraging Looker, like LookML or DBT, which support the entire process. We’d just love your perspective on that. Because when you’re talking about analyzing data, okay, this isn’t the 90s, but you hear people use the exact same language today, you know, decades later.

Joshua Odmark 12:15

Yeah, it is interesting, because it’s almost cyclical in nature. Like, even when you look at the cloud, you know, it’s kind of like a lot of people consider the cloud is almost a step back to mainframes just, you know, a lot sexier and things like that. But yeah, I mean, for me, it has been interesting and I’m the type too where, like, I hate reinventing the wheel, like with a passion. So I always go look for these tools, and in the early days, open source wasn’t really a thing. Open source was your buddy’s hot scripts, you know, it’s like, there’s no licensing with that, like, you’re just using somebody else’s script that has not been vetted, etc, etc. But to me, the open source movement has just created all sorts of very interesting things. And we’ve talked to a lot of companies today, their entire offering, they may not talk about that publicly, but it is open source. And it’s always interesting to me to find out things that are open source, like Athena at AWS is built on Presto, and things like that. So I think what’s been fascinating is the open source kind of movement has allowed entrepreneurs like myself to create tools like this, and to me, I absolutely love that, because then I can just use those to make my life easier, versus having to create it all myself. It’s like, if we had to do that our progress would be so much slower. And especially when you get into the specialized stuff. So, you guys involved with ETL, and things like that, a lot of people assume that that stuff is easy. And then when you get into the actual data of like, you know, sifting through form fills from your website or something like that, or your lead gen or something, you start to realize how crummy the data can be in almost any industry and how difficult that is to deal with. And then the sheer amounts of data. That’s been what’s been getting very interesting. We talked to a lot of enterprises, and they’ve got so much data. It’s absolutely absurd. And they’re only using a small fraction of it. And they realize how ridiculous that is. But it’s just so hard and they can’t understand the costs of analyzing all of it or using all of it, or what’s the ROI. So it’s an interesting space to see all these tools pop up that are slowly addressing all these problems. And then when you move into machine learning, the thing that was always fascinating to me about machine learning is it’s just like traditional software. Now, obviously the differences are there’s some pretty hardcore math and you know, matrices behind it and all that, but at the end of the day, operationally, it feels very similar. It’s just more of everything. More CPU, more data, more storage, more memory, you know, more pods and Kubernetes, etc, etc. So, then your problems become more painful if you don’t have the right infrastructure or etc, etc. So it’s been interesting, but I’ve just been so thankful for the open source movement and I myself try to contribute back. We’re contributors to Pulsar. We’re about to contribute back to Trino and presto, and then I’ve contributed other things in the past as well. So it’s really amazing. And I’m thankful that that movement has blown up these days.

Kostas Pardalis 15:32

Yeah, I think we are definitely building on the shoulders of giants when it comes to open source. You mentioned a couple of projects, can you tell us a little bit more about the product, the offering that Pandio has and how it relates with open source projects?

Joshua Odmark 15:50

Yeah, sure. So one of the things that was very interesting to me is, we have a managed service offering for Apache Pulsar. So Apache Pulsar is a traditional distributed messaging system. It handles typical workloads like streaming pub/sub and traditional job worker queues. And then it also has a very interesting component to it, where you can actually host serverless functions inside of it. So you can do things like you have an inbound topic, you can place a function on top of that topic. And then what it spits out on the other side for the output topic, runs through that lightweight compute thing. So you can do things like ETL. And really anything you could imagine. Routing is very popular as well. But when I came to Pulsar, because before Pandio, I was working in the insurance space. And so we were involved with a lot of the big providers of insurance, names that everybody’s kind of familiar with. And so we were delivering machine learning to them, and building machine learning for them. So we did some very interesting projects, like there’s one company who has satellite imagery of the entire United States, and they wanted to measure the roofs of all homes in the United States. That was the premise of what they wanted us to achieve. And so that’s a massive project, very interesting. And how you solve that is a lot of fun to think through.

Joshua Odmark 17:23

But at the end of the day, what is most interesting is data logistics became very painful for us. And in many of those cases. So that’s just the literal movement of data. So that was hundreds of petabytes of data to deal with to do that particular project. And we tried to use everything out there. So you know, they were in the cloud. So we used the cloud provider’s services, then that tipped over, then we shifted to some other things like Kafka, and found out that Kafka sorta doesn’t handle that stuff particularly well. So in my process of doing that, I started to see the value of a logistics piece of software. During my journey there, we actually built something custom based on Redis, because I was very good at Redis, personally, and our team had a strong experience with Redis. So we were able to do something with Redis. That was very fascinating. But it was very niche. So with Pandio, I wanted to find something. And that’s why I ended up exploring Pulsar. And the thing that’s interesting about Pulsar is it almost feels like Pulsar was created to address Kafka’s weaknesses. So for example, the separation of compute and storage is probably one of the most valuable aspects. So the brokers and Pulsar are stateless. This makes scaling it a lot easier. So you can actually properly scale those horizontally. And you can scale the compute, which is effectively the broker independent of your storage. Storage is handled with Apache Bookkeeper, and Pulsar. So you can scale those independently, scale up the storage or scale up CPU, or scale them together. That’s very powerful. It’s also built more for the container-driven world. So it doesn’t rely on low level kernel related stuff to achieve speed or things like that. So it’s more portable and more cloud native at the end of the day. And it solves the topic limitation. A very interesting customer use case that we have is there’s a large media company who was hitting the limits of Kafka with its number of topics. So based on the way it’s architected, you can only have so many topics, and this depends on how you set up your cluster. But typically a few thousand you’re not going to go above that. They wanted to create one topic per one user in their system. So that was hundreds of thousands of topics. For Pulsar this is pretty easy again because it was designed differently. So there’s lots of things like that, some ancillary edge cases where Pulsar’s just more interesting. Additionally, it supports all the messaging types. So from one SDK, you can do streaming, pub/sub or queuing. So that’s also very fascinating.

Joshua Odmark 20:18

Although on the flip side, I’ve found for a lot of developers, that’s like a curveball to them. They’re like, Wait a second, like, one software to do vastly different messaging patterns. But once they get past that, you know, they see the value in being able to just choose right inside a single SDK to do any of those messaging patterns.

Joshua Odmark 20:44

Pulsar’s been around for a long time. But you know, the community is a fraction of the size of Kafka. So we got involved, we became known for Pulsar, because there aren’t a lot of providers for it. So we run some pretty large Pulsar installations, especially in the finance world. Another interesting difference with Kafka is that Kafka is not really built for full durability, it’s built more for speed. Whereas Pulsar is built for durability. So for example, Kafka by default f-syncs to disk on a threshold basis. Pulsar does it every single message. So it’s much more interesting to like the banking world, for example, because they want zero message loss under any circumstance. But yeah, in doing Pulsar, we’ve now offered it as a managed service, because it’s gaining a lot of interest, both from people who have hit the limits of Kafka, which is typically almost always in the Fortune 1000 that are hitting those limitations. But also people who are setting out to develop new systems. Pulsar because of its ability to scale higher is a much better fit for machine learning. And that really is why we’re involved with pulsar at the end of the day. Our focus at Pandio is to help companies achieve really any form of AI or ML. It’s quite shocking–something like 87% of executives want it, but only 15% have it. There’s a lot of reasons for that. But yeah, I mean, so that’s what led us down the road of Pulsar in a nutshell.

Kostas Pardalis 22:27

That’s super fascinating. Josh, I have quite a few questions around Pulsar, and also how to use it today. But I’m very curious about the custom solution. We talked about building on Redis, and the limitations that you found in Kafka that made you go to use Redis. Can you share a little bit more information about that, like what you managed to build on Redis. I love this technology, because it’s like, it’s amazing the stuff that people have managed to build on top of Redis. And it’s always very interesting to hear about this. So it would be amazing if you could share a little bit more about it.

Joshua Odmark 23:05

Yeah, so I’m somewhat limited in some things I can talk about. But in general, Redis basically acted as a coordinator for us. So it was very interesting to us about Redis because it was extremely portable. So we treated it like it was meant to be, not as a proper data store, but more as a caching layer. But because of some of the embedded Lua and things like that you can do, you can add in some crazy powerful function like Logic, at least around how keys are managed in Redis. But it basically acted as a coordinator, because when it comes to that particular issue, we had very small payloads. So it was basically a lot of coordination happening. So instead of passing anything to do with an image, we created, basically just a metadata payload. So it was like, imagine like, it was just a reference to where it was in S3 as an example, or if the image had to be split up, and then there were four pieces of the image or 10 pieces of the image. And those need to be coordinated in a way. Because what we had basically is like a mesh network of machine learning. So a lot of people call that today like federated learning. So we built Redis, basically, as a way to coordinate a lot of federated learning. So that can be specifically around like rural and metro areas, you would have a model at the end of the day that was specific for like Chicago, or San Francisco or Los Angeles. So we use Redis really to just coordinate that federated learning and keep track of what was done. And it worked very well because you could just take the quote unquote database that Redis created, which is just a single file at the end of the day, and move that around to restore where you were. So that helped us in scenarios where we were attempting to do some learning and we wanted to halt it. So maybe we process like 12 percent of images, and then we wanted to a week later start back up at that spot. So it gave us the durability to be able to do that easily. Because Redis is dead simple, easy to install, easy to make portable, by moving keys around that you had created on one VM, and you want to put on another VM now. So it just was, at the end of the day, the easiest way to do some of that coordination. And the way we structured it was what ended up being a topic in Kafka was basically just a namespace inside of Redis. And so, you know, we could pre-calculate how many of those we needed. So maybe that was 10,000, for example. And then we knew the payload size of what was being coordinated, because it was just, you know, absolute paths to S3 objects. And then we could calculate the memory that would be needed to do that. And then we sharded it ourselves. I can’t really go into too much detail how we did that. But it’s basically the same way databases shard, you know, based on keys and things like that. So, it wasn’t really too advanced at the end of day, again, it just was coordination. But we needed to use Redis, because it was just blazingly fast, and we needed a lot of them. So thousands of those individual Redis instances.

Kostas Pardalis 26:46

Super interesting. While you were talking, you mentioned coordination. I started thinking if this is the kind of problem that you could solve with something like ZooKeeper or SD because they are used a lot for service coordination and stuff like that. But then you mentioned about thousands of instances. So I’m not sure if something like this could be used, but yeah.

Joshua Odmark 27:10

Yeah, you definitely could have done ZooKeeper. I mean, you know, I suppose this is the case with most developers. If you plan or architect or understand the requirements, and then you want to fit in the things, you know, to it, you know, I mean. Me being mainly web based programming languages, like PHP, Ruby, and things like that, a little bit of Python, ZooKeepers like that jump into Java that none of us were really ready for.

Kostas Pardalis 27:41

Absolutely. No, I mean, in the end, finding the right solution is not always like solving an equation, right? There’s no one solution. I mean, it has to do with the team. It has to do with the circumstances, what you’re doing, and at the end it’s what’s fun with technology. I mean, there are so many different tools that can be used to solve the same problem out there. And yeah, Redis one of them. That’s why, as I said, I’m always fascinated to hear what people might have to do with Redis. It’s amazing.

Kostas Pardalis 28:10

So is this something like this kind of problem that you talked about solving with Redis? Is this something that you could do today with Pulsar?

Joshua Odmark 28:18

Yeah, Pulsar would have been a lot easier, mainly because it handles the distributed nature to it. I mean, Redis today … I haven’t used it too much recently, it was kind of a while ago that this was built, but we had to make it distributed. We didn’t really need the atomic nature of Redis. But Pulsar handles that for you. And the nature of things that need to be backed up or moved around, is removed or handled by Pulsar itself, you know. So for example, if you wanted to process things again, you can easily replay messages in Pulsar. It’s got a reader interface. So if you’ve got a 1,000 messages on the topic that you’ve already processed, you can just create a new subscriber, or use a reader interface to go back to like offset zero. So some of those things are just handled for you and the producers and consumers, if you wanted to scale one up huge or the other up huge or both up huge, it’s just easier to do that. You don’t have to build a lot of that yourself with Pulsar. So I would have loved to build that solution with Pulsar. It’s to the point now where there’s actually a fair number of companies who are … if you remember the traditional concept of like an enterprise service bus … a lot of companies are moving towards using something like Pulsar to be the fabric. There’s this term called data fabric where, you know, something like you have an ESB that touches everything. So it’s both messaging patterns, it’s access to the warehouse, the data lake, the data marts, you know, etc, etc. And then it gives you some pretty interesting controls having that middleware. None of these new concepts, you know, middleware has been around forever. ESB has been around for forever. But because something like Pulsar has so much more additional capabilities, the serverless function type stuff does some interesting stuff, then you can put business logic in the middle of things. And then just traditional, you know, Pulsar also can store things indefinitely. And so, with Kafka, they have an offloading function that they just came out with relatively recently, but it’s not seamless. With Pulsar, you can offload to HDFS or S3, or any blob storage. And then you can read back out seamlessly from the SDK, you don’t have to put it back in the Pulsar. So things like that are just very interesting and make it interesting to use as a data lake for some people that are doing that. It’s just a lot of very interesting use cases.

Kostas Pardalis 31:01

Yeah, it’s pretty interesting. And you mentioned that many of the use cases that you’re dealing with at Pandio right now are around ml orchestration. Can you tell us a little bit more about this? How does something like Pulsar help with ml orchestration? And what is involved there? And I also think it’s not just Pulsar, right? It’s like Pulsar together with Presto. Is this correct?

Joshua Odmark 31:24

Yeah. So Pandio really has three pillars at the end of the day. So accessibility is the first one. So we just use an open source data virtualization technology, which is Trino. These are all kinds of optional in your journey to AI. But these are the things that most people need. So Trino is interesting, because it can connect to almost any data source, even flat file systems. And it lets you connect to maybe 5-10 data sources, 15-20, it doesn’t really matter, and then execute a single SQL statement against it. So you can join data and S3 flat files with data in Snowflake and things like that. So that’s very interesting. So that helps solve the data accessibility issue where they got data in some place, and they just need to get to it. And then Pulsar acts as the foundational component just because the movement of data becomes very difficult. And this is why Pulsar is very interesting. So I mentioned earlier that machine learning is a lot like traditional software, just more of everything. And so we focus typically on the heavy data use cases. So that might be a billion dollar media company generating click data and impression data. And so what they want to do is they want to detect fraud and click data. So they’ve got just an enormous amount of data coming in. So clickstream of data, impression stream of data, and they want to one, just be able to handle that data. So Pulsar is great at that just for ingesting data, it can scale out massively, and handle a lot of data with few resources by comparison, especially to Kafka. Kafka is kind of the number one competitor when it comes to that.

Joshua Odmark 33:12

And then we built out our machine learning framework to do it in stream. So we definitely focus on real time or near real time. But it doesn’t have to be, that just happens to be a space where there’s not a lot of tools out there to help people do things like that. So a use case might be a media company has a stream of clicks coming in, and they want to segment them as fraud or not fraud. And so they can use the Pandio service to ingest that and then apply a machine learning model against that live stream of clicks. So in real time, it can route a click to fraud or not fraud. And that helps them do various different things. Cybersecurity is another big one. So people will be streaming syslogs, access patterns, both logging into systems traditionally, like through, you know, an employee logging in or, you know, some third-party logging in, or somebody accessing a file on a file system. So all these things are getting streamed into some central system. It doesn’t actually have to be central. That’s another interesting thing about distributed Pulsar, you can do this at the edge. It doesn’t have to be centralized. But that’s a whole other topic. But that’s just a very interesting use case where you may want to flag traditional clustering of your data, you may want to flag anomalies. That’s all you’re after. You just want to know if something weird is happening? Is somebody accessing a file that, you know, they haven’t accessed in two years, and they’re accessing a lot of them? An interesting use case for one of our customers was they used this to find an employee who was downloading everything off of the company’s servers. They were clearly doing a data dump. And were likely going to leave the company. They legitimately with their user account were downloading every single file. And this was a medical company. So very sensitive to somebody doing something like that. And they caught them and fired them that day. So there’s lots of use cases like that. Again, it is weighted towards real time or near real time, but it works traditionally, as well, for things that are less important. You know, maybe you need to run something once a day or once a month. But we certainly excel in huge amounts of data like Disney+ streaming amounts of data, as well as things that are real time and need to make actions quickly. The faster you act, the more money saved.

Kostas Pardalis 35:47

Makes sense. And what about Trino? You mentioned Trino, as a data virtualization solution. How is this used in Pandio right now?

Joshua Odmark 35:55

So that was mainly a function of, you know, to really provide value with the second two pieces of Pandio, which is the middle piece, which is like logistics, and then the third piece, which is the actual machine learning, so actually building models, training, and doing the inference. A lot of people, especially big enterprises, it takes them a while to make a decision and move data. So for example, they might have accepted or decided to use Snowflake as a data warehouse or something like, like, they’ve chosen that as their future. And so now, you have to wait for sprint by sprint things to get in there to actually move data into Snowflake. So we found there’s just an opportunity where, before someone made that decision, or when they were on their road to implementing that decision, having something like Trino is very interesting, because it can just virtualize that as a stopgap. And we found too, even when you have a very forward-thinking enterprise or company in general, they usually never move all of their data into a warehouse or a data lake. It’s like the 80-20 rule feels like it pops up everywhere, you know. So there’s always like, the 20% that they want to get access to, but can’t really for various different reasons. So Trino, again, a lot of companies just like to use that because it made getting data into their pipelines a lot easier. Trino is dead simple. I mean, it’s easy to pitch, you know, it’s like, hey, do you want to run SQL against all of your data? Yeah, I’d love that. You know, it’s easy to demonstrate, you know, so and it’s, I mean, it’s not easy to run, but it’s not hard to run. So you know, we just offer that as a managed service, because it fits really well into kind of AI orchestration.

Kostas Pardalis 37:54

That’s super interesting. There is a lot of noise lately around feature stores. I don’t know if you have heard about them, like products like Tecton, or open source solutions, like Feast? What’s the relationship with the feature store compared to what you’re talking about doing in Pandio? Do you see these things working together? Do you think there’s an overlap there? How is this landscape around them starting to form?

Joshua Odmark 38:22

Yeah, so for us,we’re heavily focused on the actual training and deployment of models. So a lot of those relationships and even like the data catalog people, or even existing ML Ops platforms, there’s a lot of synergies there, you know, from us, we look at those as things we plug into to make things easier. So for example, like data catalogs, and that can actually feed into something like Trino, if you’re like, using like a hive catalog to the index, blob storage data. When it comes to feature stores and model versioning, those are very powerful things. But they’re like, I consider those things like cutting edge. I wish more people knew about them. I talked to some advanced enterprises, some of the biggest companies in the world, and I’m shocked that they can count the number of models they have in production on their hands, you know, so tools like that make it a lot easier. But yeah, so for me, we consider those things as things we would plug into, it’s very much about the Python library we build. So plugging into things like that, again, it comes back to not having to reinvent the wheel, like we’re, you know, dead focused on something very specific. And then these things that can make the road easier or allow these things to be democratized, easier, or, you know, the operational component of it easier. We love to partner with those types of things. We don’t have any plans to build some of that stuff out.

Kostas Pardalis 40:04

That’s super interesting. So just one last question for me, because I completely monopolized the conversation today, and then I’ll leave the stage to Eric, I’m very curious about something. And I have had this in my mind, like from the beginning of our conversation, also, for personal reasons, you mentioned that one of the limitations that Kafka has is about the number of topics, and you mentioned that there are companies, especially in the Fortune 1000 group of companies that they reach this limitation, can you share with us some use cases that are causing this kind of limitations to be triggered, because obviously, when Kafka was designed, they had in their mind that Kafka is going to be used in a way that nobody will need, like 1000s of topics, right. And by the way, the reason that I’m interested in this is because in my previous company, Blender, we were using Kafka, and we had to figure out a way to deal with these limitations. So I’d love to hear from you about this.

Joshua Odmark 41:04

I’ll get a little pie in the sky on you guys here, so not a ton of companies, but some companies are understanding that you can use things like Kafka and Pulsar to segment your data in very powerful ways. So in the same way, an index in the database would allow you to segment data. So that would mean that you would have an interesting use case for creating a lot of topics. So one might be, you know, a lot of companies create these segments. So let’s take media, for example. So they’ve got segments for their customers. So they might have, you know, living in major metro areas, or they might have high income earners. So they have these like segments of their customers, but they’re limited on how they can do those segments. You know, it has to be categorical or high level. I like to use the Facebook Newsfeed as an example of why this is important, and I’ll tie it to some specific use cases. So what’s interesting is, like your feed on Facebook is very much tailored to you as an individual. So you can imagine, how would you create a machine learning model that is tailored to an individual, so that’s like the Holy Grail. So instead of doing categoricals, like, if you earn between 50 grand and 75 grand, you’re in this segment, imagine if I could create one specifically for you. So something like this would involve, I now need to segment your data exclusively. So the things you like on the internet, the things you look at, with your consent, you know, imagine if that happened on a platform like Facebook. So there’s a major media company that we help out doing this right now. So they’re involved with shopping. So if you could create a topic that was specific to an individual user, now I can do very interesting things. So I was born and raised in Michigan. So I’m a big Detroit Lions fan. So, it’s pretty easy to loosely understand that I might like Detroit sports. So that’d be more like a categorical model. But it becomes very hard to track that I like an individual player. So Matthew Stafford is a quarterback for the Lions, he was just traded, it was a whole big thing. So for that shopping network to track my preference of an individual person, that’s where they start to lose the minutiae of things. And for shopping, that can be very important. Because while Stafford is no longer a Detroit Lion, I’m still a fan of him. So I might still want to buy his jersey or something on the new team that he’s on. So that’s like a capability today that someone is trying to achieve that they couldn’t. And so what they do is they create a topic that is specific to that user. And then they train a model, specifically using that user’s data. And so it ends up looking like a federated learning way where they’ve got their master model that has all the categorical stuff. And then they’ve got the federated model that’s specific to that user, it ends up feeling very much … like we all have cell phones, so we all have like acronyms or names we call our spouses or pets and things like that, you know, and the keyboard as you type messages, it starts to learn what you’re doing individually or the things you do yourself, you know. It’s very much like that, but for everything, so to be able to do things like that. It’s easy on … well, I shouldn’t say easy. It’s an amazing accomplishment on a phone. But segregating It is easy, because it’s just on your phone, you know, you’re already sandboxed in that way. But when you’re a major shopping network, that’s not so easy, you need to create that segmentation. So this ties into like, to me, I envisioned a future where companies would have tens of thousands of models minimum. Now they’ve got like hundreds, if you’re lucky, like, it is rare that I find a company that’s got hundreds of models in production. It’s typically like 10, or 20. You know, so I imagine a future where you want to have tens of thousands of models as an individual company. What would that look like? It’s going to be federated, it’s going to be distributed. What does that look like? And so that’s where I saw Pulsar as the future, because you can do tens of millions of topics. And that can be the baseline of the stream of data for each model.

Joshua Odmark 45:46

Now hosting, you know, 10 million models is its own difficult thing. But we’ve got some fascinating technology to actually do that. So that is a focus that we do too. And I was like working backwards from the Terminator examples. Like, if Skynet were to happen, what would it actually look like, from an infrastructure stack standpoint? Or data sharing? What would that look like? You would need, to me, millions of niche models, the ability to, from a mesh network standpoint, share the outputs of one model as the input of another model in some huge, like, mesh network. And so that’s why I like to me, something like Pulsar and the Python library I built is kind of like moving in that direction, because I thought, that’s the next step. You know, it’s gonna be someone who needs to create thousands of models, not 10. What do they need to do that? You know? So that’s kind of what you’ll find at Pandio. But again, I don’t want to say if Skynet happens, blame Pandio. But that’s kind of like from a technical perspective. That’s my thought process kicked off years ago.

Kostas Pardalis 46:57

This is great. Eric, it’s all yours.

Eric Dodds 47:00

Well, thank you. I have just been so fascinated by this conversation. And really, I think in the best way possible there are so many more questions I think that I have and probably Kostas as well. But we are at time, and I want to be respectful of your time. Josh, this has been really great. Loved your story, loved hearing about Pandio. And all the interesting things you’re doing. So why don’t we have you back on the show. I’d love to dig into a couple of the specific things you talked about as far as use cases, etc. So we’ll have you back on the show again, and we’ll continue the conversation.

Joshua Odmark 47:33

Awesome. Well, thanks, guys. I really appreciate it. It’s lots of fun from my perspective. I really enjoy doing things like this. So thank you again.

Eric Dodds 47:42

Well, Redis is a really fascinating tool. And we’ve seen lots of companies do really interesting things with it. That might be the most interesting Redis use case that I’ve heard about. But that’s actually not my biggest takeaway. I think my biggest takeaway was taking a stroll down memory lane, and just reminiscing a little bit about IRC and AIM, and, you know, spinning up servers to run SQL. I mean, that’s great. I just really enjoyed that. So that’s my big takeaway. I hope we can do more of that in future episodes.

Kostas Pardalis 48:20

Yeah. Although the bitter side of this is that it reminds me of how old I am. But yeah, it was. It was great to hear how things were done back in the 90s to be honest. So this was great. That was an amazing part of our conversation. It was a great conversation in general, to be honest, I mean, Josh has a lot of experience with many different things that have to do with data and building actual products on top of data. So it was amazing to hear about all these use cases, and the products he has built even before like all the latest hype over like building products over data. And of course, it was amazing to hear how they use Redis on that. What I’ll keep from the conversation that we have is about ML and how machine learning is actually deployed right now and how early we are in the commercialization of, let’s say, of machine learning. There are amazing things that are happening and a lot of work that still has to be done. And that of course means that there’s a lot of opportunity out there, both for building amazing technologies, but also building amazing companies. So let’s see what happens in the next couple of years. I think it’s going to be fascinating.

Eric Dodds 49:31

It absolutely will. Well thank you again for joining us on The Data Stack Show. More exciting episodes like this one are coming up. So until then, we’ll catch you later.

Eric Dodds 49:42

We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me Eric Dodds at Eric@datastackshow.com.

Eric Dodds 50:01

The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 43:

Doing MLOps on Top of Apache Pulsar and Trino with Joshua Odmark of Pandio

June 23, 2021

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter