지금 Current 2022: The Next Generation of Kafka Summit에 등록해서 데이터 스트리밍의 미래를 라이브로 확인해 보세요!
This blog post shows how transactional machine learning (TML) integrates data streams with automated machine learning (AutoML), using Apache Kafka® as the data backbone, to create a frictionless machine learning process. This blog post also highlights the business value of combining data streams with AutoML, the types of use cases that can benefit from a TML platform, and how TML differs from conventional machine learning processes. Finally, a general TML architecture is presented using MAADS-VIPER, MAADS-HPDE, MAADS Python Library, and Apache Kafka.
Using these technologies, we will:
If you’d like some background on the relationship between Kafka and machine learning before diving in, check out How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka.
The following list illustrates the breadth of use cases that can benefit from the application of machine learning to data streams in a scalable and robust platform:
Transactional machine learning (TML) applies auto generated machine learning algorithms to data streams with minimal manual intervention. TML creates a frictionless machine learning process, allowing organizations to build advanced and elastic TML solutions using data streams. TML was pioneered by OTICS Advanced Analytics, a company that specializes in streaming AI/ML solutions integrated with Kafka. Underpinning TML is the belief that fast data requires fast machine learning.
There are five principles of TML:
The above principles are used to identify use cases that can benefit from TML. Kafka is an ideal platform for TML solutions because it aligns with principles 1 and 2: data fluidity and stream joining. For highly non-linear data, building machine learning models, or micro-models, from the latest set of data, can improve the quality of learnings, and therefore, improve real-time decision-making.
Kafka and Confluent connectors are built for data collection from multiple source applications into Kafka or OTICS. OTICS connectors are designed to be deployed remotely or close to the source and are deployed in standalone mode, whereas Confluent connectors can be deployed close to the source or run close to Kafka in a cluster mode.
To start building TML solutions, you need the following:
|ℹ️||Note: Get started with Confluent Cloud, a fully managed event streaming service based on Apache Kafka, using the promo code CL60BLOG to get an additional $60 of free usage.* You can access a free trial for MAADS-VIPER, MAADS-HPDE, and the MAADS-Python Library by sending a request to email@example.com. OTICS will provide a one-hour free overview and setup session if needed.|
The following discusses the differences between conventional and transactional machine learning.
Conventional machine learning (CML) processes are not designed for data streams due to these reasons:
TML differs from CML in the following ways:
Here are some key considerations in the TML process to highlight:
MAADS-VIPER and MAADS-HPDE can be instantiated to an unlimited number of instances in Windows or Linux-based systems; it is only limited by hardware. Organizations can scale TML solutions easily and quickly with Confluent Cloud, which provides unlimited storage, security, and scalability in a distributed network using the REST API and the MAADS Python library.
Large TML solutions can be architected as microservices to allow solutions to load shed across distributed Kafka brokers.
The MAADS Python library is a convenient way to create low-code TML solutions by simply using the functions in the library shown in Table 1 below:
||Create topics in Kafka brokers.|
||Subscribe consumers to topics. Consumers will immediately receive insights from topics. This also gives administrators of TML solutions more control over who is consuming the insights and allows them to ensure that any issues are resolved quickly in case something happens to the algorithms.|
||Users can perform TML on the data in Kafka topics. This is very powerful and useful for “transactional learnings” on the fly using HPDE. HPDE will find the optimal algorithm for the data in a few minutes.|
||Using the optimal algorithm, users can do real-time predictions from streaming data into Kafka topics.|
||Users can even do optimization to minimize or maximize the optimal algorithm to find the best values for the independent variables that will minimize or maximize the objective function.|
||Users can produce to any topics by ingesting from any data sources.|
||Users can consume from any topic and use it for visualization.|
||Users can consume from a multiple stream of topics at once.|
||TML administrators can create a consumer group made up of any number of consumers. You can add as many partitions for the group in the Kafka broker as well as specify the replication factor to ensure high availability and no disruption to users who consume insights from the topics.|
||Users who are part of the consumer group can consume from the group topic.|
||Consolidate the data from the streams and produce to the topic|
||Users can join multiple topic streams and produce the combined results to another topic.|
||Users can create a training dataset from the topic streams for TML.|
The above functions allow anyone with minimal Python knowledge to connect to MAADS-VIPER and Kafka and perform all the necessary functions to build powerful TML solutions that can push the limits of Kafka for machine learning.
Below is a general TML architecture with Confluent Cloud:
Some points to note about the architecture:
As businesses face a post-COVID world, the need for increased operational efficiencies and broader use of AI will rise. The growth in data and the increase in the creation of data will demand faster ways of learning from the data. Having the ability to learn faster will lead to faster decision-making.
Data streams integrated with AutoML that can address large-scale problems will continue to gain traction. The resulting business value will manifest as the following:
The increased speed of creating algorithms will lead to an increased number of algorithms in circulation in the enterprise. This creates challenges with:
As optimal algorithms are found by HPDE for the training datasets, the information about the algorithm type, parameter values, and forecast accuracy are stored or produced by VIPER in a Kafka topic. In addition, MAADS-VIPER automatically keeps track of metadata associated with every model created and stores the information shown in Table 2 in a local embedded database.
|Activate/Deactivate||Administrators of VIPER can manually activate or deactivate a topic, or they can tell VIPER to automatically deactivate algorithms using alerts and notifications.|
|Topic/Algorithm||The topic name in Kafka can also be the name of the algorithm and is chosen by the user.|
|Last Read of Topic||This is the date and time when the topic was last read from. Whenever a consumer reads from a topic, VIPER records the date and time.|
|Bytes Read (KB)||The number of bytes, in kilobytes, read by the consumer are recorded. This is the egress amount. Egress is an important metric for pricing cloud consumption.|
|Active Read Days||This is the number of days that the topic has been read.|
|Consumer ID||Consumer ID uniquely identifies the consumer that is consuming from the topic.|
|Producer ID||Producer ID is the ID of the producer that is writing to the topic.|
|Group ID||Group ID is the ID of the group that the consumer may belong to. Administrators can create groups and assign consumers to the group. This could be useful for parallel delivery of information from a single topic to multiple consumers.|
|Company Name||This is the name of the company that the consumer belongs to.|
|Contact Name||This is the name of the consumer.|
|Contact Email||This is the email of the consumer.|
|Location||This is the location where the consumer resides.|
|Description||This is a description of the topic.|
|Last Offset||When a consumer reads from a topic, this field records the last offset or the location of the read in the topic.|
||This indicates if a topic is active (1) or not active (0).|
|CreatedOn||This is the date and time when the topic was created.|
|Updated||This is the last date and time that the topic was written to by the producer.|
|Last Write to Topic||This is the last date and time that the topic was written to by the producer.|
|Bytes Written (KB)||This is the number of bytes, in kilobytes, written by the producer.|
|Active Write Days||This is the number of days that the topic was written to.|
|Replication Factor||The replication factor is a feature of Kafka and signifies the number of physical computer servers to use for redundancy.|
|Partitions||This is the number of partitions in the topic.|
|Dependent Variable||If this topic is a joined topic that stores a training dataset, then this is the name of the dependent variable in the training dataset.|
|Independent Variables||If this topic is a joined topic that stores a training dataset, then this is the name of the independent variables in the training dataset.|
|MAADS-HPDE Algo Server||This is the physical IP address or network name of the AutoML server.|
|MAADS-HPDE Port||This is the network port that MAADS server is listening on.|
|MAADS-HPDE Microservice||If you are using a reverse proxy, this is the network name for the proxy. Using microservices can be very beneficial for large-scale TML solution deployments.|
|MAADS Algorithm Key||This is the key name that uniquely identifies the optimal algorithm.|
|MAADS Token||This is a secure token needed to access MAADS-HPDE.|
|Joined Topicaims||This lists the names of the topics that are joined together.|
|Group ID||This is a unique ID for the group that consumers belong to.|
|Group Name||This is the name of the group.|
|Number of Consumers||This is the number of consumers in the group.|
You are also provided with a dashboard called the Algorithm and Insights Management System (AiMS), as shown in the figure below:
The above image shows an AiMS dashboard for Kafka consumers, and similar information is provided for producers and groups. AiMS allows administrators to set up notifications and alerts to activate or deactivate algorithms. By keeping track of egress and ingress, administrators can also keep track of how much money TML solutions are costing and who is producing to and consuming from solutions.
The data in topics can be validated through the CLI or a GUI tool like Confluent Control Center.
Conventional machine learning processes are not ideal for data streams. TML with Apache Kafka offers a way forward, allowing organizations to layer on deeper machine learning to data streams to create a frictionless machine learning process for greater business impact.
MAADS-VIPER, MAADS-HPDE, and MAADS Python Library, alongside Kafka, work together to provide TML solutions that can easily scale and provide faster decision-making with deeper insights—all in real time.
As data creation speed increases, machine learning speed will also need to increase. Kafka with MAADS-VIPER and MAADS-HPDE is breaking new ground in transactional machine learning and will continue to add considerable value to organizations that are looking for faster ways to apply machine learning to data streams at scale.
Interested in learning more? Download the MAADS-VIPER Kafka Connector to get started!
If you’re looking at scaling machine learning operations, reach out to OTICS for a demo on Kafka with MAADS-VIPER and HPDE by emailing them at firstname.lastname@example.org or visit GitHub to get started with TML today.
Sebastian Maurice, Ph.D., is the founder and CTO of OTICS Advanced Analytics and has over 25 years of experience in AI and ML. Previously, he served as associate director at Gartner Consulting, and before that, he worked at Accenture, Hitachi Solutions, Capgemini, SAS, and Finning Digital. He has published seven international peer-reviewed journals and books. Dr. Maurice also teaches a course on data science at the University of Toronto and sits on the AI Advisory Board at McMaster University.