Episode 138:

Paradigm Shift: Batch to Data Streaming with A.J. Hunyady of InfinyOn

May 17, 2023

This week on The Data Stack Show, Eric and Kostas chat with A.J. Hunyady, Founder, and CEO of InfinyOn. During the episode, A.J talks about his experience with Hadoop and the challenges he faced with getting data out of the ecosystem. He also discusses the need for innovation in the data industry, real-time data, streaming, IoT, the benefits of Rust, and more.

Notes:

Highlights from this week’s conversation include:

A.J.’s background and journey in data (2:23)
Challenges with the Hadoop ecosystem (8:50)
Starting InfinyOn and the need for innovation (10:02)
Challenges with Kafka and Microservices (14:01)
Real-time data streaming for IoT devices (19:28)
Paradigm shift to real-time data processing (22:17)
Benefits of Rust (29:45)
Web Assembly and Platform Features (36:29)
Analytics and Event Correlation (40:16)
Real-time data processing (47:03)
ETL vs ELP (52:20)
Final thoughts and takeaways (57:07)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcription:

Eric Dodds 00:03
Welcome to The Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You’ll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by RudderStack, the CDP for developers. You can learn more at RudderStack.com. Welcome back to The Data Stack Show. Kostas exciting episode. I actually wasn’t here to record this original episode. But listen to your conversation with AJ from InfinyOn. Fascinating, we’ve talked about a couple of things. Two things stuck out to me, in particular, Hadoop, which we’ve talked about a couple of times. But he knows quite a bit about that ecosystem. But then secondly, IoT or Internet of Things, which we haven’t really covered a ton on the show. So super excited to hear more about that. But yeah, what do you want to ask AJ though?

Kostas Pardalis 01:05
Yeah, first of all, I mean, one of the most interesting questions I think that I have for him is about his background, he’s not like a first time founder. He has been working with technologies like nginx, for example, like so like his previous company put acquired by the company that has like nginx Sherm, it’s very interesting to see like someone who’s coming like from the networking world, in a way, getting involved into the data infrastructure, space. So I think we have a lot to discuss about that. How did this happen? Why, what’s the intuition there? And what’s someone with, like the high performance experience that is needed in the networking worlds can bring Viking data into space? And outside of these, we’re going to talk a lot about I think, streaming, processing. And as you said, the IoT use cases that are very common when it comes to streaming. So we’ll have plenty to talk about. So yep. Let’s start with AJ. Let’s do it. AJ, Welcome. Nice to have you here at The Data Stack Show. How are you today?

A.J. Hunyady 02:26
I’m good. Thank you very much. Thanks for inviting me. I’m excited to talk about Infineon.

Kostas Pardalis 02:32
Yeah, we’re also excited to talk about between you and I think we have some very interesting things to talk about, about the company, about yourself and about technology. So let’s start with you. Let’s give a little bit of a background and your history before you started Infineon.

A.J. Hunyady 02:51
Sure happy to do that. So thinking back a few years, and it’s been quite a few. I started my technology era in a company called Computer associates. Back in the day, we’re building a spreadsheet. So back in the day, it was always about how do I pick the best technology that’s going to enable me to build something really cool in the world. Instead of a spreadsheet. I spent some time there at the time with Excel. It was still Lotus 123. From then on, I moved to a company called Netcom online, which was an Internet service provider. So quite a shift from Excel spreadsheets to an Internet service provider. At the time when Netscape was not even around. We’re still actually using IRC, and that’s what I built my first email client, so I thought it was really cool. I got to see how NATS can grow. How Netscape actually is not as clean. I should have said Netscape became a big company. And from then on, my boss actually ended up working for a startup that had Bill firewalls. It was a company called Netscreen technologies. I joined as employee number 10. And I grew with the company. We eventually went IPO in 2001. It was a $1 billion dollar IPO from Danang acquired by Juniper Networks. In 2004. I think it was for about $4 billion or so was my first meaningful outcome coming from a company I’ve joined. From then on, I joined a company called Chioma systems, because that was in hardware. They were doing hardware, they were doing monitoring for various networks. You can think of it as a TAP device. That’s when I actually started to realize the value of data and the value of a lot of data. We were building a hardware where you could capture this data and do some analytics on it. So I joined the company, there were about 50 people there. I picked up an engineering team we built. I built it for around 70 people. And in 2014 we went IPO so there was my second IPO joining a company . It’s been an exciting ride. And after that, I said, Okay, I worked with these companies that were successful. Let me try to join a startup. So from then on, I moved to a company called EA security. And what we were doing at EA, we were trying to use MapReduce jobs, to find problems, kind of a Kill Chain, in a data stream. So the ability to find out any type of attacks you’re facing. Now, it turned out that there wasn’t really a big data problem, because there wasn’t a whole lot of data out there. But at least introduce me to some of the problems you run into in the Hadoop ecosystem, the ability to R MapReduce and the volumes of data, and so forth. About nine months later, I said, this is interesting, but I don’t think this is a big data. And that’s where I found my co-founder, and we created a company called sockets. It was at the time where container management was a big thing. There were containers. Docker just introduced us. They said, Well, yeah, but their management plan is not great. So we started a company called sockets. We went out there fundraised, we’ve gotten just $1 million. And seed said, Okay, we’re gonna build the product. Lo and behold, a few months later, we identified that Kubernetes jumped into that space. So as soon as Kubernetes came in, we took a bath on Docker Swarm, and we realized, oh, this could be trouble. But we built a pretty good management plane. And while we were at a choice talking about our technology, we named nginx drop by and nginx said, Wow, you guys need a management plane. How about we join forces. nginx requires zakat. And, we were at Nginx. For two and a half years, my co-founder and I, we’ve started with the control plane that began with the service mesh. And that’s where we really ran across some of the challenges that really the data companies have when they’re trying to deploy data at high scale. There were nginx, we noticed they were using microservices to link a Kafka data streaming layer with micro services to get value out data. We were the company for two and a half years, the company got acquired by five. And then we said, Okay, wait, it’s time to tackle that area on our own. Because it seemed to me that after a bit of analysis, data is growing exponentially, in particular real time data. And there are no good solutions out there to fix this problem. And here we are today. We started Infineon, we started Finland and for 19, we required a lot of seed, we had a few seed rounds preceding I should call. And then last year, we actually got to see that we are backed by gradient Ventures by five ventures with participation from Bessemer ventures and TSCC. And we are looking to build next generation data services for real time.

Kostas Pardalis 07:59
Wow, that’s quite a journey to be honest, like you started from spreadsheets, data rights, and yet I love to play that game after so many like more than like two decades. And through that, like a couple of IPOs like a couple of startups. So if you’re, you’ll usually love doing that. Like that’s the first thing that I can recognize, like, you keep like going and building. I think like from scratch, which is amazing. And I’d love to talk more about that. But you mentioned that at some point you were working with the Hadoop ecosystem. And that helped you identify some of the issues like the Hadoop ecosystem? I can’t. Can you tell us a little bit more about that? Like, what kind of like, what are these problems

A.J. Hunyady 08:50
with some of the issues of the Hadoop ecosystem, it was really easy to get data in. But it was really hard to get data out. If you think about it, getting data and Hadoop or Kafka or some sort of an agent or rebuilding, we just tried to restructure our structured data. We’re trying to get it up, you actually had to create a MapReduce job. And that MapReduce shop, actually, was orders of magnitude larger with the volume of data. So if you would actually grow exponentially, you’re getting more data in the MapReduce shop takes longer and longer, you’re adding more Microsoft more business intelligence in the long run longer that microservice day. So even though it was labeled as a data store for unstructured data that you can do analytics on top. The analytics were lacking. Didn’t that’s why you probably saw that Hadoop fell out of favor. Yeah. In favor of new technologies, such as Snowflake and so forth. Yeah. 100% Do you feel like that

Kostas Pardalis 09:51
have these issues that can’t do card back things have been addressed? Or is there still space for innovation? and value creation?

A.J. Hunyady 10:02
Well, if I learned anything throughout my career is that there’s never enough innovation, regardless of where Look at there is always room for improvement. Yeah. And in particular, when it comes to Hadoop, we felt that. And that has something to do with Infineon why we started this company as well. So if you take a look back and look at the history of data, it was Hadoop first, and then Snowflake came in to fill a gap that had to blast into the market. So what Snowflake promises to do is, it makes data writing data in and getting data out easily as well. So there, their ability to fix this problem is to the ability to give you access to data processing to some of the analytics tooling. And if you look at the modern data stack today, you’ll end up with a S3 typically written today in S3, then you have a Snowflake connector to gather data from S3 into Snowflake, then you have DBT to run some level of transformation that is built on a bunch of micro services. And eventually you get some sort of value from that data. Robbie, we’re seeing in that space that a lot of stuff is actually ETL. So you have to write the data into your data store, while getting the data out, it’s still it provides it is still delayed, meaning that it was MapReduce. Now it’s batch processing. So you’re improving the data to some extent, but not really enough. In particular, when it comes to real time services. If you’re for instance, data is actually doubling every two and a half years. And at that scale, you simply don’t have enough compute power to be able to grow with it. I mean, the computers lag rather fast. And at the end of the day, you have to have better tooling that enables you to process that data more effectively, before it hits the Snowflake database. Yes, for our product content. This is why we believe there is a need for a paradigm shift in this market. But instead of actually using ETL and jobs to get the value out of your data, you could move some of that processing earlier in the stack before it runs into the database. So you get value for the data for a new type of service or even to eliminate some of the complexity you’ve created by creating this ETL job. Yeah, 100% it

Kostas Pardalis 12:25
makes total sense. So okay, that’s all like one part of the equations I really experienced with Hadoop. And then you mentioned at some point of it being like, with nginx, you also started seeing another part of the problem that led you to go in like built Infinium today. Okay, nginx is a web server, right? Like, that’s what everyone knows about Nginx. So, the killer will be able to understand what exactly you experienced, like as part of nginx that’s Lindy back to data and processing of data.

A.J. Hunyady 12:59
Sure. So while I was in nginx, we transitioned out of control at some point or the controller team or we put this management plan for nginx micro services into the data mesh. Okay. So if you’re familiar with data mash, it was linker D initially, and then STO came along. And sto said, you know, we’re gonna give you the ability to build to stitch together microservices, in a containerized environment and so forth, and will allow you to do that through proxies. And sto actually picked envoy as a proxy. And we said, well, that’s great, but nginx has been around for longer. So why don’t we add nginx into the external environment. So we’ve taken nginx and placed it in a nice new environment. We use Kafka as the intermediation layer for monitoring. Okay. And as you’re introducing that into the market, we found out that CI is skyrocketing and is really hard to use. It’s a nice ZooKeeper Java based micro service that we have to create to build on top of a Kafka on top of a Kubernetes ecosystem. It’s really difficult to manage. And that’s what drove us towards looking into ways to improve that ecosystem. As we were investigating that market, we realized that wow, companies are actually not only using Kafka as an intermediation for monitoring. They are using Kafka to build this new class of services. So we found that Kafka is in Netflix and Airbnb in stripe and all these companies are trying to build real time experiences where it’s critical to our existence that they roll out real time service. And they’re using Kafka and then they’re taking kaftans digits together with microservices, and they’re adding Flink. They’re creating this pretty large environment. They call it a modern data stack for the service. But there was you really, in order to get any value out of that you had to employ a lot of engineers, you had to get a lot of technology, you had to build a lot of Glue logic on your own. So if you look around, you see these companies building their own stacks and making them available on the open source domain so others can utilize them. These guys have deep budgets, and they’re able to run these environments on their own. So when we looked at that, he said, I think this is a great way to build a modern data stack. But it doesn’t seem like it is very good, because it’s using technology that was built in the big data age, they still have a base, they still require ZooKeeper, and they still do garbage collection. It still doesn’t have a Kubernetes connector unless you actually built one. And that’s why we started the salaries and by the way to do this, so yeah, we go. Yeah.

Kostas Pardalis 16:00
Can we talk about before we get into the product and the technology, you mentioned Kafka and Flink and ultra micro services. Can you give us like one or two, like use cases? Like how, like, why would someone like to stitch together like all these pretty heavyweight technologies, right? Like Flink Golang, shoulders, legs, a beast. The same about Kafka? Like you mentioned all the services and like, all that stuff. And we’re not even talking about anything that has to do with infrastructure where these things run on top. Right? So yeah, it’s not just okay, I’m going to download the CLI or just download like Docker remote execution stopped working like a, it’s a lot, right? So what drives teams and companies to get into so much trouble, right? Like, what is the use case there? Like? How is it, based on the use case, they also like space these things together? So

A.J. Hunyady 17:01
There are multiple use cases I should have. There’s probably more time that we have the show, but I can focus on a few of them. Yep, for instance, it’s personalization. Okay, that’s one of them. For instance, when you watch a Netflix show, Netflix needs the ability to see what you’re watching there, and the frequency they’re watching certain shows, and then make a recommendation, as soon as you’re finished. It gives you personalized information based on what you’ve done. So in order to do that, they’re capturing analytics information about you. They’re capturing this information at scale for millions of users, they’re watching the shows. And they’re thinking that athletics information, they’re building aggregates over time, and they’re getting an outcome. How many people use the show to see how successful it is? What type of person is it? What’s the geolocation? What are some of the things that are important in order for me to get better user engagement? How is this user going really, why is it going to stick to my platform, and I’d go to other alternative products, such as HBO, or Hulu or whatnot. So in order to capture that information is very complex. First, you need a data collection pipeline, then you need to collect a bunch of information for everything, the frequency, the interval, varies, but sometimes you collect information about people on the second, sometimes every minute, sometimes every hour, and so forth. But that’s really the data collection pipeline today is done with some sort of a service that you build on your own, and you put it on the device that currently the streaming agent runs, then you need an product that does some level on average aggregation. So this will be able to collect the data. And that’s where typically Kinesis comes in, or a Kafka comes in, you take all the data from the edge devices, send it to Kinesis to Kafka, to be able to send the information across your network. And some of the time that information ends up in a database. And that’s where you would actually write it to an S3, they write to the Snowflake, and then you get data out of it. But if the information is important, they have to do it very fast, then you will need to deploy microservices, because you want to get information out of that data as soon as it happens. So you deploy microservices, because for instance, I know that some in a certain geolocation users are watching at one o’clock at night, and you wonder what’s driving that you build a microservice, to get that level of information about your business logic, maybe join it on a different data set, the geolocation, maybe the time, maybe the weather and get an outcome. And that micro service, then you take that information and send it to a new stream. And typically that you end up with a kind of a complex environment, data streaming micro service and our data stream. And in some cases, say, Okay, well, Flink does this computation analytics for me and it does it based on a window. So let me just do that. Then apply Flink, and then I build a microservice tail end of that. And that is very complex infrastructure. That is what is driving this the ability to get information from the user from those users in real time within a Z stack, and typically that takes a lot of stitching together. So that’s just one use. Here’s another use case. You could talk about a bunch of companies out there that are trying to roll out IoT devices. And they’re doing IoT devices for industrial automation. For one of the companies we worked with, they’re, they’re rolling out sensors for oil pipelines. So what these guys are actually doing is then they need to identify what are the the what is the pressure on an oil pipeline that that it’s in Saudi Arabia, or any kind of other country? Is there any leakage? Is there a pump that’s heating up? So all these sensors have to be aggregated, and you have to get information about that pump, for instance, in real time, they cannot afford for the pump to just run and explode, or just all of a sudden stop all flow, they have to react to it. And the faster they react, the better they can address the issue. And sometimes that translates to cost, sometimes in efficiency, sometimes the customer experience or whatever that may be. The point is that you’re using our technology to get information very fast to process it and get an outcome and get the notification that the pump should be addressed. And then more importantly, than that, sometimes the information from the pump is insufficient, maybe the pump is heating up, because it was a very hot day, then you have to take the data, marry another data set and get an outcome. So those are the type of problems that people want to use, that they’re trying to solve, in order for them to tackle a real time use case Oh,

Kostas Pardalis 21:40
That’s great. I think the use cases are very informative in terms of like, what are the problems we’re dealing with here? So let’s, okay, maybe we know, we talked about history problems. Let’s talk about the future, right? And let’s look at the failure of a company that you have started. So tell us a little bit more about that like, and the product? Let’s start with that, like, what is it that you’re building right now to address these use cases and these problems that we’ve talked about.

A.J. Hunyady 22:17
So we believe that we are facing a paradigm shift. And the paradigm shift is, is moving some of the data from as I mentioned, from after the database to prior to the database, instead of using batch processing to kind of get an outcome out of your data. If you move the data into the data streaming layer, then you’re going to be able to get to the data faster, and you don’t create the operational complexity, you don’t create the cost associated with storing lots more data than actually don’t really need. In some cases, for example, when you’re, we’ve actually done prediction, or city Helsinki buses, so you get all the GPS information coming in. And you don’t need to make a prediction when the bus arrives at the station. And if for example, I only needed information for a limited amount of time, I need the last 1015 20 minutes of data. So I know when the bus is gonna arrive. My station was the bus that goes past that station. I’m not interested in that data anymore. But what about if I need to provide SLA is because the bus providers, I need to find out what are the actual buses that did arrive on time. So that has historical value. But that being said, I don’t need to take that information at one second or one millisecond interval, I just need chunks of information I need every five minutes, every station, I mean, did they arrive at the right station at the right time? So then it gives you the ability. So the ability, when you process data in real time, you have the ability to do two things, you have the ability to issue processes in real time as it arrives to you and create aggregates and give you the data to the operator to generate reports later on. So we believe that a lot of companies out there are switching over from some of these database functions into real time to tackle the class of problems I just stated. But that being said today, you can only do that if you’re deploying an infrastructure that’s been built really in the big data age. Java, it’s difficult to deploy, it’s difficult to operate and so forth. Now what Infineon is doing is actually we believe that if that paradigm shift is here to stay, and we believe it is, you’re going to need a better product that enables you to run your operations in real time. Data that gives you the ability to not only bring the data in very fast, there has to be a small footprint, it has to be able to scale to go all the way to the edge because in most cases you’re collecting from an agent that you have to build on your own. He has to be able to not only capture the data, but also the competition also is on it and has to give you an outcome. So we set ourselves up to create a product that’s it’s relatively new and the space is built with new programming language built is built in Ross from the ground up is giving you the ability to apply programmability on top to web assembly gives you the ability to compensate Issue Analysis to sometimes what we call materialized view. And it gives you the ability to share all that information or that intelligence across the network. So I can talk a little bit about the stack and tell you how we’re thinking about these layers if that’s what you’d like

Kostas Pardalis 25:13
Yeah, let’s definitely do that because I like some very interesting topics they’re just like others. They sound a little bit better like how the product before we got into the technology like fixing what’s in the picture of like today, I was like people aren’t doing and like what? How they would use Infineon today, right. So we are talking about and correct me if I’m wrong. We are talking about a more lightweight and much smarter kind of Kafka. Like a way to put it.

A.J. Hunyady 25:51
I’ll put a lightweight, more performant Kafka last microservices level of Flink processing for a large number of use cases. Okay, as you mentioned, Flink is a beast, yet very big product. Yeah, Spark is a beast that a lot of people are running into, well, I need to run Spark because that’s the only or fling because that’s the only way that I can do counting, even if it’s a small subset. So that becomes really expensive. Yeah,

Kostas Pardalis 26:23
yeah. So today, I would get into union and like replace parts of my stack or complement my stack with Infineon

A.J. Hunyady 26:32
Actually, it depends on the use case. If you are using it, we are finding that customers are coming to us with different requirements. Some of them say I will never run Java in my ecosystem again. They run clang the run, they run the Kafka, and they run into operation complexities and hire large teams in order for them to maintain it. And they say, You know what, anything but that are there alternatives out there. And one of the alternatives would be asked, so in the case that you are really looking to do something from the ground up, and you don’t rely on a technology such as Kafka, you really don’t want to deploy it, which is perfectly suitable to deploy. Even some of the advantages of that is that we are coming. We build data collection pipelines, data collection pipelines mean that for example, for the IT vendors that we’re working with, their take is smart. We have a client and a server, obviously, just like Kafka does. The client is actually built in rust, and he can compile in virtually any type of an infrastructure we can compile down to Raspberry Pi Zero, we actually have, we have users directly using the Raspberry Pi, we have users that build their own micro services chip. They actually created their own microservices, the micro the small chips, yeah, name now that they are running, and they’re running our client on them. So our client fits virtually, it’s very small size, it has a que they compile it, you run it and it applies some of business logic that we could be on the server we can extend it to the edge from then on the client communicates directly with Infineon cluster. So they feed in clusters and we do some of the things that Kafka does. You have the ability to ingest, we have the ability to route replication to partitioning with all these things. In that case, why would you need Kafka? You have to bring the data that you want to do processing on the data will give you the ability to transform web assembly and the same data on its way to another product. Now, in cases that you already have Kafka and we have those use cases as well. You have Kafka and you say, You know what, we, you’re a new company, you don’t have too many connectors. But Kafka gives me the ability to deploy all these connectors either on the ingress side or the egress side. We build Kafka connectors. So you have we have the ability to connect to a Kafka cluster existing Kafka clusters, and give you the ability to do the programmability, apply some of the web assembly functions on top, build your own custom logic, apply it on top of the stack, set it on its way, you in some cases, we have Kafka as an ingest to our product. In some cases, we have Kafka as a destination. Yeah.

Kostas Pardalis 29:18
That’s awesome. Yeah, I think it’s very clear. Let’s get a little bit deeper into technology. You already mentioned a couple of interesting acronyms out there. And let’s start with rust. You mentioned some of the benefits of working with languages like Rust compared to something like Java. Let’s get a little bit deeper into that. What’s, what’s the benefits of rust?

A.J. Hunyady 29:45
Well, Rust is a new language compared to Java. It’s a modern language. It’s a safe language. So Ross gives you the ability to compile the code in without the interpreter running directly on your system. You don’t need to have a shim layer like a JVM or some sort of sandbox to run with an atoll in Python, you need an interpreter to run it. So rust gives you the ability to run code fast to run a safe and run because it runs on the machine itself. And it’s optimized for core safety. So what do I mean by that is, it controls your memory. So I used to be a C C++ developer. Early in my career, we know the buffer overruns problems, we know soundpoint are problems where you point a problem, you put a point around memory location, I have a stack of Ron and all of a sudden bad things happen. Now Ross gives you the ability to practice and protect yourself from that. So we’re seeing companies out there are moving from CDC first product into rust plus is interpreted as the next generation c plus, if you want a safe, memory safe product that enables you to run code fast and run safely probably rust is the language for you.

Kostas Pardalis 31:03
What’s the benefit of using something like rust in a product like Infineon like what’s the theme, the value that it brings in two ways like from two sides, sorry, like one is like from you like as the vendor who is using right to build your technology. But also what’s the value that the customer gets at the end, right? Because for us it has been used.

A.J. Hunyady 31:27
So for us as a vendor, the beautiful, beautiful part of rust is that you build a compiler that just runs very seldom, I mean, I don’t remember having a crash, we’ve been building this product for quite a while we have just over 250,000 lines of code. I really don’t remember having a crash. So you compile it, Russ gives you very good compilation capabilities. It enables you to run some really, I mean, enables you to compile and run some really solid code. From the vendor’s perspective, what they’re really gaining is the performance and the small footprint is the ability to get some of the numbers we’re studying as calf casts. For instance, we are we are we have, we’re using 20% less memory than then Kafka does. Our code compiles and aminu school is 20 times smaller than a Kafka would have. The CPU utilization is five times more efficient than Kafka. The throughput is three times seven times more efficient than you get from Kafka. So these are the benefits, that gets the ability to run the code smart code base that can run anywhere natively, you don’t have to have a Java environment for it, then you have the ability to get some of the performance benefits. Some of the core safety benefits, some of the security benefits, due to the fact that we build it in Rust so that you get that added benefit of our language of choice. Yeah,

Kostas Pardalis 32:53
you mentioned something interesting, like before, a little bit earlier about the IoT. Use case that because you’re using Ross, and you can compile to a target that is like a microcontroller or something like a raspberry or whatever. Like you can bring, let’s say, part of the infrastructure to the edge, right? Which, obviously, like it’s super, super interesting, but what was the industry doing before, like, someone who’s like using something like Kafka, right? And they have IoT, mice T’s and these IoT devices? Oakley, they cannot run like a JVM. Like, it’s too much. How do they handle this?

A.J. Hunyady 33:36
Well, a lot of them are using MQTT. So if you’re using entity servers, as a mediation device, look, it works. But you end up in different classes of problems. And we have several vendors that are moving from MQ DT on top of our own, and also significant benefits to it. It’s not only the size and it’s not only because you don’t want to run and Qt t, because it creates another issue regarding now you have to backup directly to the server, if you can communicate both ways. But you have an intermediation device and it’s a broadcast, not ideal. And setting up that network is very complicated. So when you’re moving over thought technology, now, I didn’t get them to stack. So I don’t think I’m gonna throw some terms out there that may not be all that familiar, because I didn’t explain what the two things really are. But we have the ability to do things, be able to run a very small footprint, and then the ability to take some of the what we call smart modules, the ability to take a spark sub module that does business logic and push it towards the edge. The beauty of that is that you can build the spark module, you can publish it now as a smart module hub, and all the edges can actually take that piece of functionality. They can update it as it’s required. And you can run filtering on it. For instance, you can send a smart module saying that if I’m geolocation Euro, then I’m interested in what you’re going to send to me or I should be sending information up to the controller only to the IMF in Europe, Asia. In Europe, so then whoever uses the client gets the smart pointer that gets applied to this device and be able to do multicast or broadcast. That’s one simple use, they can use case would be, if you take a smart module, now we are having, we have some users out there that say, you know, I would like to send some AI ml modules into the edge, I don’t really, and there are some large companies, and they have to do with electricity, that they’re interested in that, for instance, they want to know, if they’re gonna have problems with their transformers. And they need to know as soon as they’re seeing warning signs. And these guys would benefit from running Sound Machine Learning at the edge. So those are the second type of value added services that our partner can offer. You can send this like intelligent logic, which is in our AI ml, or a filtering function, or a transformation function into the edge. Yeah. Yeah, those are some of the benefits are having Ross, but not only Ross, the ability to send programmable logic, which is actually built in web assembly on at the edge, let’s

Kostas Pardalis 36:06
get to that now. Because that’s like, like, very interesting. So, okay, we talked about rust, you’re also using WebAssembly. And you touched a little bit like why you do that. But tell us a little bit more about this, like how it works and how it is. What are the features that it adds to the platform?

A.J. Hunyady 36:29
Okay, so I think it’s time for us to talk a little bit about a platform to tell you the layer. So we really build a five layer stack. The basic layer is the data streaming layer. So that’s what it’s all about the throughput, the latency, the performance, when the streaming has to be fast, and we built it on our own, one of one of our first principles was performance. Because as I mentioned, computers will not grow as fast as the data is coming at you. I mean, we would like to, but then it typically cannot really keep up. So we picked the better programming language, we picked a better architecture, it’s done in Kubernetes. By design, when it comes to the core functions, we use declarative management, that’s where we get better performance, we get better latency throughput serialization memory utilization, you call it second layer on top of that, and it’s really the two layers are layer two and layer three are related is transformation. And it’s analytics. So by transforming the analytics layer, it makes your data usable. Getting data on a data stream is not all that interesting, we’ve been doing that for a long time. But then if the data is not the right shape, you write it to some sort of a storage, then you pick it up and get the right shape. And most of the time you’re writing back into storage, that’s for hand up and also to finish problems and so forth. But I digress a little bit. So let’s talk about this layer two, which is actually the transformation. So we add a transformation to the product as a way to manage microservices as a way to stitch together topics to microservices and naturally have this what some people call microservices spaghetti in a cloud environment. Because you get all these micro services and inside a micro service, you have to specify that a fire comes from that topic and has to get to that topic, applying this business logic. So what we’ve done is we are actually using web assembly in order to allow you to write these micro services in a sandbox environment where it’s safe to not compromise the data stream, while at the same time, we have the ability, you have the ability to code it and write it, and we will store it for you. And I’m gonna get to that a little bit later on our intelligence store, which is actually the hub. So the whole idea is that we use microservices, we give it the ability to apply a custom logic on the data streaming service on its own. And that makes service I mean that intelligence could be applied to the cluster, in the cluster or at the edge. That’s what I’ve actually mentioned in your last conversation. And these are very powerful constructs. Yeah. Now, with microservices, you can filter, transform, and clean your data before it actually gets into a data store. For example, you can even do things like oh, gee, I’m getting a social security number and this data stream, but I don’t want the database to see it. So you can apply transformation at OMAP. Remove the social security number of the mascot and encrypt it. So I sent it to the data store in an encrypted form because it was not supposed to be there. So the whole thing, the fact that we’re doing the micro services, the fact that we’re using web assembly to do that, it gives us the ability to do it in a sandbox environment and a very high performance because web assembly was, some people argue, next generation Docker. Now, I wouldn’t go that far. But the whole idea is to run this little sandbox environment and give you security. It’s really important for security and speed. And we’re benefiting from that because we give you the ability to build the services, but you don’t have to move the data to the service itself as you do with a Kafka environment. We actually take the intelligence to the data, you build the microservice, then you apply it to the data stream, and you don’t have to specify Fire what data streams you’re applying it to, because you can pick it up and apply it to any one of them. So you don’t really build this pipelining. In Code. pipelining is separate from code you build, maybe one person can build a pipeline, and the other person can build the intelligence that should be applied on top of it. And the operator data engineer can just inherit that functionality and apply it. So this is the second layer transformation, third layer analytics. Now, the really important part is it’s what transformation you can do packet by packet type of operations. But if you go into Analytics, you actually, you have to do more than that. You have to look contextually at multiple packets and say, “Oh, gee, do I have an urgent event that I have to cater to?” And sometimes identifying an urgent event is not only based on a message, it’s a base to the server messages. For example, one of the banks we’re working with, they’re trying to identify anomalies, say, Well, if one of our customers has a transaction in Paris, and then I can give away where they’re from, I guess, and a second is in London, within 15 minutes, well, obviously, then that’s something they can do. So with our product, now, we’re able to scan a series of packets within we make like since we have the immutable store on the herd, then we can scan the packets, and we can look at it for a specific window interval and give it the ability to get an outcome. So event correlation, I mean, it’s very important as well, yeah, data coming in, then you have to do Richmond, for example, we have a one of our vendors, and we’re actually using this for our own consumption, we are building a usage for our cloud and offline with me, and you talked about pricing. So pricing is all about getting the quantity of information that you’re traversing to your network, and also the price associated, and the price may change. It’s a property of time. So then, when you need a product, and we actually work with a vendor that’s building this microservice that takes information and computes usage for other companies, they have to merge the pricing. And that turns out to be a very complicated problem. So we have the ability, if anyone does that, to take the data for you. And by the way, this analytics function has not yet been released. It’s currently down with design partners alone. But it’s a very powerful function. So the ability to do that is layer two, which we call analytics. Now, layer three is the axis of what we call the access and connection layer. So as we’ve learned, in order for you to deploy this class of product, and I alluded to earlier, you need connectors. Connectors are hard to build, there are companies out there that do that for a living, like Arbeit, Fivetran, and formatting and so forth. So we found that there are lots of connectors out there, and some of them only cover a portion of the functionality and for a good reason. For example, Salesforce connector has hundreds of API’s, how can you build a generic connector for that? So we said, You know what, we were going to offer you a few sets of connectors that are going to be certified connectors, HTTP and PTT, a bunch of others, Kafka, some databases for Postgres for MongoDB, and for analytics tools. But anything other than that, how about if we focus on giving a developer an environment to make connectors development easy. So what we’ve done instead of actually building all the connectors out there, because we felt like it’s an NP complete problem, we can never really keep up with it, it’s not really a solvable problem, we give you the ability to create connectors on your own. So we give people something called CDK connect to develop the kid and SMD case smart modular development kit, to be able to give you a framework that you work with them, then you roll out these connectors on your own. And that’s what I’m coming to the last part of our stack, which is the sharing part, we build something called the data hub. But it’s not a data hub . Lots of data products out there are about it’s a data warehouse, but we’re not. We build this for usability for sharing the intelligence for building connections to sessions and the hub building smart magicians and the hub. So now you have the ability to fader microservices, show them in the heart, instead of being distributed around a network. Nobody wants to maintain nobody knows what they are, you build up five times you do the same function. With us, you’re able to take this Microsoft that you build, which is a smartwatch or a connector, publish it as a Data Hub, then yourself, you can use it, your company can use it, or the entire world can use it if you choose the way publishing you want to have. So overall, this is our stack. It’s a five layer stack. We have the ability to do data streaming, transformation, and analytics, you have the ability to create your own connectors, and you can share them.

Kostas Pardalis 44:43
That’s awesome. And that’s a lot of functionality there. I think people should go to your website and start reading documentation and going through all the different things that can be done with the platform. One question because I’m very curious to understand how, like the mechanics of this feature work, like, you mentioned, how WebAssembly like, is used to transfer business logic to the client, right? What does this mean? Like? Do I have to recompile the clients? Like what is the user experience? Because, okay,

A.J. Hunyady 45:28
so I guess I jumped too quickly into the idea that we gave you something that is called a Smart modular development kit. Right now, we only compile rust. For convenience, we are actually thinking of bringing a single language in and we have that in prototyping it. We want to bring in Python as part of the spark module development kit, because we found that engineers don’t necessarily want to learn rust in order to do that. Actually, some are, interestingly enough, this new language I’m interested in. But the whole point here is that we give you the ability to build a smart modular development kit, to be able to just like an NPM array to create a template to create a workspace, apply your own business logic palette, test it locally, and then publish it to the smart module hub. Once it’s published to the smart module hub, we actually create our own packaging. So the smart module hub, it’s a hub of intelligence that creates the packaging and tells you how we can use this. Because you can imagine when you create this intelligence Mark module, you can add parameters. So the idea is these parameters. For instance, if I want to do filtering, I don’t want to write all the possible filtering in the world in a program because I’m going to have 50 100,000 programs. But I want to say filter, what parameter and the parameter is actually tells you Okay, filter in this entity within that data set. So when you’re building these programs, you have parameters, we publish the data hub. Now, do you have to compile again, when you pull it down? No, absolutely not. We have the runtime, which is the whole web assembly runtime in the client itself. So you can take it down and run it there, or we have the runtime in the cluster itself. So we have something called fields, which are streaming processing units, and that’s where our intelligence lies. And we give you the ability to apply the smart policy in the runtime. And by the way, there are lots of benefits to that, and I might invest a little bit, but I think it’s useful. So when you apply your intelligence in the cluster itself, it turns out that you’re really reducing the amount of traffic you have to bring down to the client, or the other way around. When you apply the intelligence at the edge, you reduce the amount of data I have to take to the cluster for processing as well. So when we apply this technology, then you have some of the benefits that you do reduce the data workload. And what do you do on the client, for example, a new version of the smart module is available, we actually do versioning for you. So you give you the ability to see, oh, there is a new version, I just need to run a little small piece of logic to say when a new version arrives, please upload it for me. And don’t have to, you don’t have to compile anything. Just bring the new version in and you’re ready for one.

Kostas Pardalis 48:03
Yeah, that’s amazing. And I’d love to wear clothes, like at the end of the episode now. But I think we should spend more time talking about these technologies, because I think WebAssembly is one of these technologies that it’s still early. I don’t think that tooling is great for someone to start working with WebAssembly right now, it’s probably scary. But there is a lot of potential, and we see some of this potential with what you’re doing. And I think it will be great if I don’t like a couple of weeks or like miles like to get out on another episode and talk more about that stuff. And what you’ve learned and what you have discovered by doing this. We’re not using

A.J. Hunyady 48:51
web assembly because of the technology per se. We are using web assembly because we want to solve a problem. And the problem was how can I run code safely on top of data streaming, or at the edge without me impacting the quality of that code? Because if it’s insecure, if noisy neighbors, all sorts of bad things can happen. So we found that web assembly is the right form for us, the right packaging for us to be able to run set code safely on top of the product. Yeah, it’s

Kostas Pardalis 49:23
100%. That’s all good. We outside of the technology itself and what it shows from for you, as the vendor, it also has like a huge impact on the experience that the customer has, right? Yes. And that’s something that I think would have just started, like scratching like Vishal phrases of the potential here. And to be honest, it’s like there’s not many people out there doing serious stuff like in production right now. So I think it’s, it’s amazing that we have you here like to talk about these things and like Dhokla Like, bringing WebAssembly in production, right? And not just talk in terms of what potentially this technology can do. Like here we have the potential we have, like impacts. And I think we should spend time, or that like in the future. So, one more question. I hear you all this time, and I can’t stop thinking of it like a debate that started a couple of years ago. About ELD versus ETL. Right. So we used to come for like, since forever, like a concept of ETL that says, Okay, we extract the data, or the data comes from somewhere, and we first transform the data, do some stuff like on these, and then we load it into the warehouse, right? And start doing like analytics there. Then ELT game, and they’re like, No, you don’t need to apply any kind of processing or business logic. And the data was, the data is still emotional, like, landed into the data warehouse and like the data warehouse to go and like, do all the work, right?

A.J. Hunyady 51:07
I’m wondering who says that? I would argue that maybe it’s the Snowflake of the world or the data stack? Builder, you know, you want to have a data store? Of course, yeah.

Kostas Pardalis 51:20
But what I hear from you is like, okay, there are limitations to that, obviously. And I think it is a very strong example of why these limitations exist. It’s like these IoT examples, right? Where you are constraints, and you have to bring processing before the data even leaves the shores, right. Like not even like when the data is like on the wire. So it seems like we’re going inside of this, like REITs, like, from ETL to ELP. Maybe, and let’s go back to it here, right? What’s your take on that? And where do you think, at the end? Like the solution lies, right? Like, is it somewhere in the middle? Which one is the other isn’t just marketing at the end? It doesn’t matter. It’s at the end is like, like what the problem of the customer?

A.J. Hunyady 52:17
I’d love to hear your opinion, as I think the answer is you will need both. I don’t think the answer is one or the other. It all depends on your use case. cases where you have real time use cases, and it’s a matter of life and death. I mean, do you have places where you want to capture information about autonomous vehicles? For example, John Deere has 5000 sensors to tell if something goes bad, and there’s John Deere running over a house, right? Do you want to know right away, and ideally, you should find out before it actually runs into a house. Yeah. So in cases like this, when you need to do intelligent processing, and it’s not only one sensor, you typically get information on multiple sensors, do a lot of you apply a lot of business logic to identify in a geo locate geospatial case, for instance, in you need to censor and your dear but geolocation, how close is the house, how many pieces of information you need to actually join together in order for you to get information out of it. Now, in light of the situations, in emergency services, I would argue that you will need to do processing before they hit your database. Because if it gets buried in a database, and it takes you 235 minutes or half an hour to get that data back to make a decision on it, chances are you’re too late. While there are a bunch of other use cases that you want to have an ETL tool, for instance, I would not I would argue that you don’t want to run machine learning on top of data streaming service, it’s not built for that is built for having processed data fast, even though it stores data in noodle store. You want to run machine learning in a data store that should have lots of data that has lots of compute power, a spark and maybe Snowflake that are introducing new services now. So for those things, you continue doing that. Now I’m looking at RudderStack like RudderStack. You still need to run ETL jobs because the data is stored in different data stores. You’re still interested in customer experience and the customer experiences it all the time. It doesn’t necessarily have to be urgent. It’s good to have some services that are urgent because you catch the customer as it’s operating with you as it uses the iPhone because oh, I actually know the guy is actually operating the iPhone now. He runs into my app, maybe I send them an ad, maybe I send him something that keeps him longer, maybe keeps him in the store. It depends on who you are. But then aside from that, you have all these enriched services you have to get information from maybe a Salesforce because you just had a sale and maybe a HubSpot because you send a marketing campaign, a website enrichment because he just added something to your shopping cart. So this customer to 60s are not really well suitable for real time yet but engaging with a customer real time. It’s important to and that’s actually a different use case altogether.

Kostas Pardalis 54:57
Yeah, yep. 100% All right. We’re here at the end. I feel like we’re gonna keep going for at least like a couple more hours, which means that we should call you back on the show in the future, before we, before we go anything like Where can someone go and learn more about Infineon?

A.J. Hunyady 55:20
Well, I’d like to finish with a conclusion and then I’ll tell you how to get our product. Eventually we do have two products: Infineon and Flavio, which is the open source commercial version. But before I get to that, in conclusion of what I said today, I think in essence, we are seeing a paradigm shift coming. And really, there is no good way to solve these classes of real time problems without a new platform. And that’s what we’ve done at Infineon, we tried to make real time data first class citizen, where you start processing data before it gets into the database. If you’re writing to the database for this class of services, we see cracks in the dam. We’re seeing that the existing technologies such as Java based technology, it’s not really I mean, it’s addressing the problem, but it’s addressing the problem poorly and you’re going to pay the price in the long run. So I believe that if you want to build, roll out the modern infrastructure, the salsas or discuss the products in the modern way. You should take a look at Infineon so we have Infineon infineon.com is the website where you can find us. That’s where the commercial version is. And we have a Infineon cloud that uses flow view, which is our open source version under the hood. So it gives you the ability to run the clusters in the cloud itself. We actually rolled out a wide gloving now, again, the ability for you to have a solutions architect, take you step by step to rolling out a service. And we are offering several $150 worth of credits for that. So if you don’t want to go to the cloud, and you want to have an experience, we believe that you will get a better experience through us therefore we offer you more credit. So that’s another way to help you get up and running with our product.

Kostas Pardalis 57:00
Awesome. Thank you so much every day and we hope to have you back really soon. Thank you, Kostas.

Eric Dodds 57:06
I appreciate your time. As always a fascinating conversation with AJ from Infineon we covered a lot of subjects. Hadoop, IoT real time streaming, and web WebAssembly cost us. I’m actually interested in what interested you most in terms of the conversation about rust and WebAssembly. I mean, Rust is obviously very popular in gaining in popularity, but anything that stuck out to you. Yeah,

Kostas Pardalis 57:38
especially in comparison with Kafka and let’s say like this previous generation of data infrastructure systems and system programming in general, that was very biased towards technologies like the JVM. And we see that today, because of technologies like grass, systems programming has become much more accessible to more people. And overall , like the whole, let’s say the Ross ecosystem has created many new opportunities. And I think Infineon, like both the company, and the product, is an example of many similar things that we are going to see in the future. Companies embracing these new products, and these new frameworks. And in a way, like, let’s say you rebuild, and adapt technologies and paradigms from the past, but make them much more efficient, and much more compatible with the needs and the scale that we have today. And tomorrow. For us these, I think it’s interesting, not just because of, let’s say, the performance, or like most people talk about I think, and that’s like the interesting part, when we’re comparing the JVM ecosystem. JVM became popular because people could go and write code without having to deal with all the issues and the problems that usually arise by having to manage memory, which is a pretty hard problem. Now, RAS is like a new type of language that gives you, let’s say, the low level access and performance that you get, with a language like C right? That is where you have to go and manage memory, while at the same time the compiler is smart enough that it guides you to manage the memory correctly. So when actually the program compiles, it’s going to be safe, like it’s going to operate like it should without like security issues, or a crossing and like all that stuff, and like avoiding many bugs, and making that like available like two more legit developers like to convert back to the past. And that’s one of the things I think it’s like a great enabler of a language like Ross. The other thing is that Ross has an Amazing, let’s say ecosystem. And that’s where WebAssembly comes into the picture and which makes it very interesting. Infineon also knows what they are building, because as AJ said, like, in a way, this combination of Rust and WebAssembly gives them the opportunity to build a system that’s extremely extensible. So someone can go like, for example, write something like a plugin or a function for the Infineon platform, using and combining that into WebAssembly. And this is going to run inside the Infineon platform. And this is going to be super performant and super safe and secure at the same time because of the guarantees that WebAssembly has. So these two examples like the new way of writing systems, programming for systems, and the interoperability that WebAssembly provides, I think is going to revolutionize a lot like what is happening on the server, although WebAssembly was primarily built for the client. And it’s just the beginning like what Infineon is actually doing today. I think it’s going to be something we will see more and more in the future for sure. For anyone who is interested, what’s next? They should definitely check what Infineon does and what AJ has to say about the things that they are building.

Eric Dodds 1:01:22
Absolutely, definitely a must listen, for anyone interested in those topics. So yeah, definitely subscribe if you haven’t told a friend. We always love new listeners, and we will catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe to your favorite podcast app to get notified about new episodes every week. We’d also love your feedback. You can email me, Eric Dodds, at eric@datastackshow.com. That’s E-R-I-C at datastackshow.com. The show is brought to you by RudderStack, the CDP for developers. Learn how to build a CDP on your data warehouse at RudderStack.com.

🎙 Sign up for The Future of Machine Learning Livestream!

🗞️ Signup for Our Newsletter

Episode 138:

Paradigm Shift: Batch to Data Streaming with A.J. Hunyady of InfinyOn

May 17, 2023

Notes:

Transcription:

About the Podcast

Sign Up for The Data Stack Show Newsletter