Friday

Kafka and AI

 

Photo by Karim Sakhibgareev on Unsplash

Overview

Apache Kafka® is a hot technology amongst application developers and architects looking to build the latest generation of real-time and web-scale applications. According the official Apache Kafka® website “Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.”

Why Use a Queuing or Streaming Engine?

Kafka is part of general family of technologies known as queuing, messaging, or streaming engines. Other examples in this broad technology family include traditional message queue technology such RabbitMQ, IBM MQ, and Microsoft Message Queue. It can be said that Kafka is to traditional queuing technologies as NoSQL technology is to traditional relational databases. These newer technologies break through scalability and performance limitations of the traditional solutions while meeting similar needs, Apache Kafka can also be compared to proprietary solutions offered by the big cloud providers such as AWS Kinesis, Google Cloud Dataflow, and Azure Stream Analytics.

  • To smooth and increase reliability in the face of temporary spikes in workload. That is to deal gracefully with temporary incoming message rates greater than the processing app can deal with by quickly and safely storing the message until the processing system catches up and can clear the backlog.
  • To increase flexibility in your application architecture by completely decoupling applications that produce events from the applications that consume them. This is particularly important to successfully implementing a microservices architecture, the current state of the art in application architectures. By using a queuing system, applications that are producing events simply publish them to a named queue and applications that are interested in the events consume them off the queue. The publisher and then consumer don’t need to know anything about each other except for the name of the queue and the message schema. There can be one or many producers publishing the same kind of message to the queue and one or many consumers reading the message and neither side will care.
  • add an new API application that accepts customer registrations from a new partner and posts them to the queue; or
  • add a new consumer application that registers the customer in a CRM system.

Why Use Kafka?

The objectives we’ve mentioned above can be achieved with a range of technologies. So why would you use Kafka rather than one of those other technologies for your use case?

  • It’s highly scalable
  • It’s highly reliable due to built in replication, supporting true always-on operations
  • It’s Apache Foundation open source with a strong community
  • It has built-in optimizations such as compression and message batching
  • It has a strong reputation for being used by leading organizations. For example: LinkedIn (orginator), Pinterest, AirBnB, Datadog, Rabobank, Twitter, Netflix (see https://kafka.apache.org/powered-by for more)
  • It has a rich ecosystem around it including many connectors
  • A distributed log store in a Kappa architecture
  • A stream processing engine

Looking Under the Hood

Let’s take a look at how Kafka achieves all this: We’ll start with PRODUCERS. Producers are the applications that generate events and publish them to Kafka. Of course, they don’t randomly generate events — they create the events based on interactions with people, things, or systems. For example a mobile app could generate an event when someone clicks on a button, an IoT device could generate an event when a reading occurs, or an API application could generate an event when called by another application (in fact, it is likely an API application would sit between a mobile app or IoT device and Kafka). These producer applications use a Kafka producer library (similar in concept to a database driver) to send events to Kafka with libraries available for Java, C/ C++, Python, Go, and .NET.

The next component to understand is the CONSUMERS. Consumers are applications that read the event from Kafka and perform some processing on them. Like producers, they can be written in various languages using the Kafka client libraries.

The core of the system is the Kafka BROKERS. When people talk about a Kafka cluster they are typically talking about the cluster of brokers. The brokers receive events from the producer and reliably store them so they can be read by consumers.

The brokers are configured with TOPICS. Topics are a bit like tables in a database, separating different types of data. Each topic is split into PARTITIONS. When an event is received, a record is appended to the log file for the topic and partition that the event belongs to (as determined by the metadata provided by the producer). Each of the partitions that make up a topic are allocated to the brokers in the cluster. This allows each broker to share the processing of a topic. When a topic is created, it can be configured to be replicated multiple times across the cluster so that the data is still available for even if a server fails. For each partition, there is a single leader broker at any point in time that serves all reads and writes. The leader is responsible for synchronizing with the replicas. If the leader fails, Kafka will automatically transfer leader responsibility for its partitions to one of the replicas.

In some instances, guaranteed ordering of message delivery is important so that events are consumed in the same order they are produced. Kafka can support this guarantee at the topic level. To facilitate this, consumer applications are placed in consumer groups and within a CONSUMER GROUP a partition is associated with only a single consumer instance per consumer group.

The following diagram illustrates all these Kafka concepts and their relationships:

A Kafka cluster is a complex distributed system with many configuration properties and possible interactions between components in the system. Operated well, Kafka can operate at the highest levels of reliability even in relatively unreliable infrastructure environments such as the cloud.

Simple way we can use for an AI/ML application where we are using video streaming for smart camera system. AI application detect the person and object for further analysis.

Here are the simple steps on how Kafka can be used for AI/ML application where video streaming is required to detect from image:

  1. Install Kafka. Kafka is an open-source distributed streaming platform. You can install it on your local machine or on a cloud platform.
  2. Configure Kafka. You need to configure Kafka to create a producer and a consumer. The producer will be responsible for sending the video streaming data to Kafka, and the consumer will be responsible for receiving the data and processing it.
  3. Create a producer. The producer will be responsible for sending the video streaming data to Kafka. You can use the Kafka producer API to send the data.
  4. Create a consumer. The consumer will be responsible for receiving the video streaming data from Kafka and processing it. You can use the Kafka consumer API to receive the data.
  5. Process the data. The consumer will need to process the video streaming data and detect objects or events of interest. You can use a machine learning model to do this.

Here is a diagram of the steps involved:

Video Streaming Data

Producer → Kafka → Consumer → Machine Learning Model

In details a step-by-step guide on how Kafka can be used for an AI/ML application that involves video streaming and image detection:

1. Set up a Kafka cluster: Install and configure Apache Kafka on your infrastructure. A Kafka cluster typically consists of multiple brokers (servers) that handle message storage and distribution.

2. Define Kafka topics: Create Kafka topics to represent different stages of your AI/ML pipeline. For example, you might have topics like “raw_video_frames” to receive video frames and “processed_images” to store the results of image detection.

3. Video ingestion: Develop a video ingestion component that reads video streams or video files and extracts individual frames. This component should publish each frame as a message to the “raw_video_frames” topic in Kafka.

4. Implement image detection: Build an image detection module that takes each frame from the “raw_video_frames” topic, processes it using AI/ML algorithms, and generates a detection result. This component should consume messages from the “raw_video_frames” topic, perform image analysis, and produce the detected results.

5. Publish results: Once the image detection is complete, the results can be published to the “processed_images” topic in Kafka. Each message published to this topic would contain the corresponding image frame and its associated detection results.

6. Subscribers or consumers: Create subscriber applications that consume messages from the “processed_images” topic. These subscribers can be used for various purposes such as real-time visualization, storage, or triggering downstream actions based on the detection results.

7. Scalability and parallel processing: Kafka’s partitioning feature allows you to process video frames and image detection results in parallel. You can have multiple instances of the image detection module running in parallel, each consuming frames from a specific partition of the “raw_video_frames” topic. Similarly, subscribers of the “processed_images” topic can scale horizontally to handle the processing load efficiently.

8. Monitoring and management: Implement monitoring and management mechanisms to track the progress of your video processing pipeline. Kafka provides various tools and metrics to monitor the throughput, lag, and health of your Kafka cluster.

By following these steps, you can leverage Kafka’s distributed messaging system to build a scalable and efficient AI/ML application that can handle video streaming and image detection.

You can find out more on https://kafka.apache.org/ Most of the information used here were collected from Apache Kafka site and its documents.

You can find many examples in internet including this one https://www.researchgate.net/figure/Overview-of-Kafka-ML-architecture_fig3_353707721

Thank you.

No comments:

Azure Data Factory Transform and Enrich Activity with Databricks and Pyspark

In #azuredatafactory at #transform and #enrich part can be done automatically or manually written by #pyspark two examples below one data so...