Showing posts with label apache. Show all posts
Showing posts with label apache. Show all posts

Wednesday

Learning Apache Parquet

Apache Parquet is a columnar storage format commonly used in cloud-based data processing and analytics. It allows for efficient data compression and encoding, making it suitable for big data applications. Here's an overview of Parquet and its benefits, along with an example of its usage in a cloud environment:

What is Parquet?

Parquet is an open-source, columnar storage format developed by Twitter and Cloudera. It's designed for efficient data storage and retrieval in big data analytics.

Benefits

Columnar Storage: Stores data in columns instead of rows, reducing I/O and improving query performance.

Compression: Supports various compression algorithms, minimizing storage space.

Encoding: Uses efficient encoding schemes, further reducing storage needs.

Query Efficiency: Optimized for fast query execution.

Cloud Example: Using Parquet in AWS


Here's a simplified example using AWS Glue, S3 and Athena:

Step 1: Data Preparation

Create an AWS Glue crawler to identify your data schema.

Use AWS Glue ETL (Extract, Transform, Load) jobs to convert your data into Parquet format.

Store the Parquet files in Amazon S3.

Step 2: Querying with Amazon Athena

Create an Amazon Athena table pointing to your Parquet data in S3.

Execute SQL queries on the Parquet data using Athena.


Sample AWS Glue ETL Script in Python

Python


import sys

from awsglue.transforms import *

from awsglue.utils import getResolvedOptions

from pyspark.context import SparkContext

from awsglue.context import GlueContext

from awsglue.job import Job


# Initialize context and Spark session

glue_context = GlueContext(SparkContext())

spark = glue_context.spark_session


# Load data from source (e.g., JSON)

datasource0 = glue_context.create_dynamic_frame.from_catalog(

    database="your_database",

    table_name="your_table")


# Convert to Parquet and write to S3

glue_context.write_dynamic_frame.from_catalog(

    frame=datasource0,

    database="your_database",

    table_name="your_parquet_table",

    format="parquet",

    storage_location="s3://your-bucket/parquet-data/")


Sample Athena Query

SQL

SELECT *

FROM your_parquet_table

WHERE column_name = 'specific_value';

This example illustrates how Parquet enhances data efficiency and query performance in cloud analytics. 


Here's an example illustrating the benefits of converting CSV data in S3 to Parquet format.


Initial Setup: CSV Data in S3

Assume you have a CSV file (data.csv) stored in an S3 bucket (s3://my-bucket/data/).


CSV File Structure


|  Column A  |  Column B  |  Column C  |

|------------|------------|------------|

|  Value 1   |  Value 2   |  Value 3   |

|  ...      |  ...      |  ...      |


Challenges with CSV

Slow Query Performance: Scanning entire rows for column-specific data.

High Storage Costs: Uncompressed data occupies more storage space.

Inefficient Data Retrieval: Reading unnecessary columns slows queries.


Converting CSV to Parquet

Use AWS Glue to convert the CSV data to Parquet.


AWS Glue ETL Script (Python)

Python


import sys

from awsglue.transforms import *

from awsglue.utils import getResolvedOptions

from pyspark.context import SparkContext

from awsglue.context import GlueContext

from awsglue.job import Job


# Initialize context and Spark session

glue_context = GlueContext(SparkContext())

spark = glue_context.spark_session


# Load CSV data from S3

datasource0 = glue_context.create_dynamic_frame.from_catalog(

    database="your_database",

    table_name="your_csv_table")


# Convert to Parquet and write to S3

glue_context.write_dynamic_frame.from_catalog(

    frame=datasource0,

    database="your_database",

    table_name="your_parquet_table",

    format="parquet",

    storage_location="s3://my-bucket/parquet-data/",

    partitionBy=["Column A"])  # Partition by Column A for efficient queries


Parquet Benefits

Faster Query Performance: Columnar storage enables efficient column-specific queries.

Reduced Storage Costs: Compressed Parquet data occupies less storage space.

Efficient Data Retrieval: Only relevant columns are read.


Querying Parquet Data with Amazon Athena

SQL


SELECT "Column A", "Column C"

FROM your_parquet_table

WHERE "Column A" = 'specific_value';


Perspectives Where Parquet Excels

Data Analytics: Faster queries enable real-time insights.

Data Science: Efficient data retrieval accelerates machine learning workflows.

Data Engineering: Reduced storage costs and optimized data processing.

Business Intelligence: Quick data exploration and visualization.


Comparison: CSV vs. Parquet

Metric CSV Parquet

Storage Size 100 MB 20 MB

Query Time 10 seconds 2 seconds

Data Retrieval Entire row Column-specific


Here are some reference links to learn and practice Parquet, AWS Glue, Amazon Athena and related technologies:

Official Documentation

Apache Parquet: https://parquet.apache.org/

AWS Glue: https://aws.amazon.com/glue/

Amazon Athena: https://aws.amazon.com/athena/

AWS Lake Formation: https://aws.amazon.com/lake-formation/


Tutorials and Guides

AWS Glue Tutorial: https://docs.aws.amazon.com/glue/latest/dg/setting-up.html

Amazon Athena Tutorial: https://docs.aws.amazon.com/athena/latest/ug/getting-started.html

Parquet File Format Tutorial (DataCamp): https://campus.datacamp.com/courses/cleaning-data-with-pyspark/dataframe-details?ex=7#:~:text=Parquet%20is%20a%20compressed%20columnar,without%20processing%20the%20entire%20file.

Big Data Analytics with AWS Glue and Athena (edX): https://www.edx.org/learn/data-analysis/amazon-web-services-getting-started-with-data-analytics-on-aws


Practice Platforms

AWS Free Tier: Explore AWS services, including Glue and Athena.

AWS Sandbox: Request temporary access for hands-on practice.

DataCamp: Interactive courses and tutorials.

Kaggle: Practice data science and analytics with public datasets.

Communities and Forums

AWS Community Forum: Discuss Glue, Athena and Lake Formation.

Apache Parquet Mailing List: Engage with Parquet developers.

Reddit (r/AWS, r/BigData): Join conversations on AWS, big data and analytics.

Stack Overflow: Ask and answer Parquet, Glue and Athena questions.

Books

"Big Data Analytics with AWS Glue and Athena" by Packt Publishing

"Learning Apache Parquet" by Packt Publishing

"AWS Lake Formation: Data Warehousing and Analytics" by Apress

Courses

AWS Certified Data Analytics - Specialty: Validate skills.

Data Engineering on AWS: Learn data engineering best practices.

Big Data on AWS: Explore big data architectures.

Parquet and Columnar Storage (Coursera): Dive into Parquet fundamentals.

Blogs

AWS Big Data Blog: Stay updated on AWS analytics.

Apache Parquet Blog: Follow Parquet development.

Data Engineering Blog (Medium): Explore data engineering insights.

Enhance your skills through hands-on practice, tutorials and real-world projects.


To fully leverage Parquet, AWS Glue and Amazon Athena, a cloud account is beneficial but not strictly necessary for initial learning.

Cloud Account Benefits

Hands-on experience: Explore AWS services and Parquet in a real cloud environment.
Scalability: Test large-scale data processing and analytics.
Integration: Experiment with AWS services integration (e.g., S3, Lambda).
Cost-effective: Utilize free tiers and temporary promotions.

Cloud Account Options
AWS Free Tier: 12-month free access to AWS services, including Glue and Athena.
AWS Educate: Free access for students and educators.
Google Cloud Free Tier: Explore Google Cloud's free offerings.
Azure Free Account: Utilize Microsoft Azure's free services.

Learning Without a Cloud Account

Local simulations: Use Localstack, MinIO and Docker for mock AWS environments.
Tutorials and documentation: Study AWS and Parquet documentation.
Online courses: Engage with video courses, blogs and forums.
Parquet libraries: Experiment with Parquet libraries in your preferred programming language.

Initial Learning Steps (No Cloud Account)

Install Parquet libraries (e.g., Python's parquet package).
Explore Parquet file creation, compression and encoding.
Study AWS Glue and Athena documentation.
Engage with online communities (e.g., Reddit, Stack Overflow).

Transitioning to Cloud

Create a cloud account (e.g., AWS Free Tier).
Deploy Parquet applications to AWS.
Integrate with AWS services (e.g., S3, Lambda).
Scale and optimize applications.

Recommended Learning Path

Theoretical foundation: Understand Parquet, Glue and Athena concepts.
Local practice: Experiment with Parquet libraries and simulations.
Cloud deployment: Transition to cloud environments.
Real-world projects: Apply skills to practical projects.

Resources

AWS Documentation: Comprehensive guides and tutorials.
Parquet GitHub: Explore Parquet code and issues.
Localstack Documentation: Configure local AWS simulations.
Online Courses: Platforms like DataCamp, Coursera and edX.

By following this structured approach, you'll gain expertise in Parquet, AWS Glue and Amazon Athena, both theoretically and practically.

What is Pyspark

PySpark is a Python API for Apache Spark, a unified analytics engine for large-scale data processing. PySpark provides a high-level Python interface to Spark, making it easy to develop and run Spark applications in Python.

PySpark can be used to process a wide variety of data, including structured data (e.g., tables, databases), semi-structured data (e.g., JSON, XML), and unstructured data (e.g., text, images). PySpark can also be used to develop and run machine learning applications.

Here are some examples of where PySpark can be used:

  • Data processing: PySpark can be used to process large datasets, such as log files, sensor data, and customer data. For example, a company could use PySpark to process its customer data to identify patterns and trends.
  • Machine learning: PySpark can be used to develop and run machine learning applications, such as classification, regression, and clustering. For example, a company could use PySpark to develop a machine learning model to predict customer churn.
  • Real-time data processing: PySpark can be used to process real-time data streams, such as data from social media, financial markets, and sensors. For example, a company could use PySpark to process a stream of social media data to identify trending topics.

Here is a simple example of a PySpark application:

Python
import pyspark

# Create a SparkSession
spark = pyspark.sql.SparkSession.builder.getOrCreate()

# Load a dataset into a DataFrame
df = spark.read.csv("data.csv", header=True)

# Print the DataFrame
df.show()

This code will create a SparkSession and load a CSV dataset into a DataFrame. The DataFrame will be printed to the console.

PySpark is a powerful tool for processing and analyzing large datasets. It is easy to learn and use, and it can be used to develop a wide variety of applications.

You can get more details to learn https://spark.apache.org/docs/latest/api/python/index.html

Friday

Apache Spark

 

unplush

Apache Spark is a powerful, free, and open-source distributed computing framework designed for big data processing and analytics. It provides an interface for programming large-scale data processing tasks across clusters of computers.

Here’s a more detailed explanation of Apache Spark and its key features:

1. Distributed Computing: Apache Spark allows you to distribute data and computation across a cluster of machines, enabling parallel processing. It provides an abstraction called Resilient Distributed Datasets (RDDs), which are fault-tolerant collections of data that can be processed in parallel.

2. Speed and Performance: Spark is known for its speed and performance. It achieves this through in-memory computation, which allows data to be cached in memory, reducing the need for disk I/O. This enables faster data processing and iterative computations.

3. Scalability: Spark is highly scalable and can handle large datasets and complex computations. It automatically partitions and distributes data across a cluster, enabling efficient data processing and utilization of cluster resources.

4. Unified Analytics Engine: Spark provides a unified analytics engine that supports various data processing tasks, including batch processing, interactive queries, streaming data processing, and machine learning. This eliminates the need to use different tools or frameworks for different tasks, simplifying the development and deployment process.

5. Rich Ecosystem: Spark has a rich ecosystem with support for various programming languages such as Scala, Java, Python, and R. It also integrates well with other big data tools and frameworks like Hadoop, Hive, and Kafka. Additionally, Spark provides libraries for machine learning (Spark MLlib), graph processing (GraphX), and stream processing (Spark Streaming).

Here’s an example to illustrate the use of Apache Spark for data processing:

Suppose you have a large dataset containing customer information and sales transactions. You want to perform various data transformations, aggregations, and analysis on this data. With Apache Spark, you can:

1. Load the dataset into Spark as an RDD or DataFrame.

2. Apply transformations and operations like filtering, grouping, joining, and aggregating the data using Spark’s high-level APIs.

3. Utilize Spark’s in-memory processing capabilities for faster computation.

4. Perform complex analytics tasks such as calculating sales trends, customer segmentation, or recommendation systems using Spark’s machine learning library (MLlib).

5. Store the processed data or generate reports for further analysis or visualization.

By leveraging the distributed and parallel processing capabilities of Apache Spark, you can efficiently handle large datasets, process them in a scalable manner, and extract valuable insights from the data.

Overall, Apache Spark has gained popularity among data engineers and data scientists due to its ease of use, performance, scalability, and versatility in handling a wide range of big data processing tasks.

Here is an example code, we first create a SparkSession which is the entry point for working with Apache Spark. Then, we read data from a CSV file into a DataFrame. We apply transformations and actions on the data, such as filtering and grouping, and store the result in the result DataFrame. Finally, we display the result using the show() method and write it to a CSV file.

Remember to replace “path/to/input.csv” and “path/to/output.csv” with the actual paths to your input and output files.

from pyspark.sql import SparkSessio


# Create a SparkSession
spark = SparkSession.builder.appName("SparkExample").getOrCreate()


# Read data from a CSV file into a DataFrame
data = spark.read.csv("path/to/input.csv", header=True, inferSchema=True)


# Perform some transformations and actions on the data
result = data.filter(data["age"] > 30).groupBy("gender").count()


# Show the result
result.show()


# Write the result to a CSV file
result.write.csv("path/to/output.csv")


# Stop the SparkSession
spark.stop()

Kafka and AI

 

Photo by Karim Sakhibgareev on Unsplash

Overview

Apache Kafka® is a hot technology amongst application developers and architects looking to build the latest generation of real-time and web-scale applications. According the official Apache Kafka® website “Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.”

Why Use a Queuing or Streaming Engine?

Kafka is part of general family of technologies known as queuing, messaging, or streaming engines. Other examples in this broad technology family include traditional message queue technology such RabbitMQ, IBM MQ, and Microsoft Message Queue. It can be said that Kafka is to traditional queuing technologies as NoSQL technology is to traditional relational databases. These newer technologies break through scalability and performance limitations of the traditional solutions while meeting similar needs, Apache Kafka can also be compared to proprietary solutions offered by the big cloud providers such as AWS Kinesis, Google Cloud Dataflow, and Azure Stream Analytics.

  • To smooth and increase reliability in the face of temporary spikes in workload. That is to deal gracefully with temporary incoming message rates greater than the processing app can deal with by quickly and safely storing the message until the processing system catches up and can clear the backlog.
  • To increase flexibility in your application architecture by completely decoupling applications that produce events from the applications that consume them. This is particularly important to successfully implementing a microservices architecture, the current state of the art in application architectures. By using a queuing system, applications that are producing events simply publish them to a named queue and applications that are interested in the events consume them off the queue. The publisher and then consumer don’t need to know anything about each other except for the name of the queue and the message schema. There can be one or many producers publishing the same kind of message to the queue and one or many consumers reading the message and neither side will care.
  • add an new API application that accepts customer registrations from a new partner and posts them to the queue; or
  • add a new consumer application that registers the customer in a CRM system.

Why Use Kafka?

The objectives we’ve mentioned above can be achieved with a range of technologies. So why would you use Kafka rather than one of those other technologies for your use case?

  • It’s highly scalable
  • It’s highly reliable due to built in replication, supporting true always-on operations
  • It’s Apache Foundation open source with a strong community
  • It has built-in optimizations such as compression and message batching
  • It has a strong reputation for being used by leading organizations. For example: LinkedIn (orginator), Pinterest, AirBnB, Datadog, Rabobank, Twitter, Netflix (see https://kafka.apache.org/powered-by for more)
  • It has a rich ecosystem around it including many connectors
  • A distributed log store in a Kappa architecture
  • A stream processing engine

Looking Under the Hood

Let’s take a look at how Kafka achieves all this: We’ll start with PRODUCERS. Producers are the applications that generate events and publish them to Kafka. Of course, they don’t randomly generate events — they create the events based on interactions with people, things, or systems. For example a mobile app could generate an event when someone clicks on a button, an IoT device could generate an event when a reading occurs, or an API application could generate an event when called by another application (in fact, it is likely an API application would sit between a mobile app or IoT device and Kafka). These producer applications use a Kafka producer library (similar in concept to a database driver) to send events to Kafka with libraries available for Java, C/ C++, Python, Go, and .NET.

The next component to understand is the CONSUMERS. Consumers are applications that read the event from Kafka and perform some processing on them. Like producers, they can be written in various languages using the Kafka client libraries.

The core of the system is the Kafka BROKERS. When people talk about a Kafka cluster they are typically talking about the cluster of brokers. The brokers receive events from the producer and reliably store them so they can be read by consumers.

The brokers are configured with TOPICS. Topics are a bit like tables in a database, separating different types of data. Each topic is split into PARTITIONS. When an event is received, a record is appended to the log file for the topic and partition that the event belongs to (as determined by the metadata provided by the producer). Each of the partitions that make up a topic are allocated to the brokers in the cluster. This allows each broker to share the processing of a topic. When a topic is created, it can be configured to be replicated multiple times across the cluster so that the data is still available for even if a server fails. For each partition, there is a single leader broker at any point in time that serves all reads and writes. The leader is responsible for synchronizing with the replicas. If the leader fails, Kafka will automatically transfer leader responsibility for its partitions to one of the replicas.

In some instances, guaranteed ordering of message delivery is important so that events are consumed in the same order they are produced. Kafka can support this guarantee at the topic level. To facilitate this, consumer applications are placed in consumer groups and within a CONSUMER GROUP a partition is associated with only a single consumer instance per consumer group.

The following diagram illustrates all these Kafka concepts and their relationships:

A Kafka cluster is a complex distributed system with many configuration properties and possible interactions between components in the system. Operated well, Kafka can operate at the highest levels of reliability even in relatively unreliable infrastructure environments such as the cloud.

Simple way we can use for an AI/ML application where we are using video streaming for smart camera system. AI application detect the person and object for further analysis.

Here are the simple steps on how Kafka can be used for AI/ML application where video streaming is required to detect from image:

  1. Install Kafka. Kafka is an open-source distributed streaming platform. You can install it on your local machine or on a cloud platform.
  2. Configure Kafka. You need to configure Kafka to create a producer and a consumer. The producer will be responsible for sending the video streaming data to Kafka, and the consumer will be responsible for receiving the data and processing it.
  3. Create a producer. The producer will be responsible for sending the video streaming data to Kafka. You can use the Kafka producer API to send the data.
  4. Create a consumer. The consumer will be responsible for receiving the video streaming data from Kafka and processing it. You can use the Kafka consumer API to receive the data.
  5. Process the data. The consumer will need to process the video streaming data and detect objects or events of interest. You can use a machine learning model to do this.

Here is a diagram of the steps involved:

Video Streaming Data

Producer → Kafka → Consumer → Machine Learning Model

In details a step-by-step guide on how Kafka can be used for an AI/ML application that involves video streaming and image detection:

1. Set up a Kafka cluster: Install and configure Apache Kafka on your infrastructure. A Kafka cluster typically consists of multiple brokers (servers) that handle message storage and distribution.

2. Define Kafka topics: Create Kafka topics to represent different stages of your AI/ML pipeline. For example, you might have topics like “raw_video_frames” to receive video frames and “processed_images” to store the results of image detection.

3. Video ingestion: Develop a video ingestion component that reads video streams or video files and extracts individual frames. This component should publish each frame as a message to the “raw_video_frames” topic in Kafka.

4. Implement image detection: Build an image detection module that takes each frame from the “raw_video_frames” topic, processes it using AI/ML algorithms, and generates a detection result. This component should consume messages from the “raw_video_frames” topic, perform image analysis, and produce the detected results.

5. Publish results: Once the image detection is complete, the results can be published to the “processed_images” topic in Kafka. Each message published to this topic would contain the corresponding image frame and its associated detection results.

6. Subscribers or consumers: Create subscriber applications that consume messages from the “processed_images” topic. These subscribers can be used for various purposes such as real-time visualization, storage, or triggering downstream actions based on the detection results.

7. Scalability and parallel processing: Kafka’s partitioning feature allows you to process video frames and image detection results in parallel. You can have multiple instances of the image detection module running in parallel, each consuming frames from a specific partition of the “raw_video_frames” topic. Similarly, subscribers of the “processed_images” topic can scale horizontally to handle the processing load efficiently.

8. Monitoring and management: Implement monitoring and management mechanisms to track the progress of your video processing pipeline. Kafka provides various tools and metrics to monitor the throughput, lag, and health of your Kafka cluster.

By following these steps, you can leverage Kafka’s distributed messaging system to build a scalable and efficient AI/ML application that can handle video streaming and image detection.

You can find out more on https://kafka.apache.org/ Most of the information used here were collected from Apache Kafka site and its documents.

You can find many examples in internet including this one https://www.researchgate.net/figure/Overview-of-Kafka-ML-architecture_fig3_353707721

Thank you.