Skip to main content

Posts

Showing posts with the label data pipeline

Fine Tuning LLM

Fine-tuning a language model for code completion tasks involves several steps, including data preprocessing, model training, and evaluation. Here’s a detailed overview of the process, specifically tailored for a hypothetical Phi3 SLM (Super Language Model):  1. Data Preprocessing Preprocessing is crucial to prepare the dataset for fine-tuning the model effectively. Here are the steps involved:  Data Collection - Source Code Repositories: Gather data from various programming languages from platforms like GitHub, GitLab, Bitbucket, etc. - Public Datasets: Use publicly available datasets like the CodeSearchNet dataset or others provided by the AI community.  Cleaning and Formatting - Remove Comments: Depending on the task, you might want to remove or keep comments. For code completion, retaining comments might help in understanding the context. - Normalize Code: Standardize code formatting, such as indentation, line breaks, and spacing. - Remove Duplicates: Ensure there are...

Masking Data Before Ingest

Masking data before ingesting it into Azure Data Lake Storage (ADLS) Gen2 or any cloud-based data lake involves transforming sensitive data elements into a protected format to prevent unauthorized access. Here's a high-level approach to achieving this: 1. Identify Sensitive Data:    - Determine which fields or data elements need to be masked, such as personally identifiable information (PII), financial data, or health records. 2. Choose a Masking Strategy:    - Static Data Masking (SDM): Mask data at rest before ingestion.    - Dynamic Data Masking (DDM): Mask data in real-time as it is being accessed. 3. Implement Masking Techniques:    - Substitution: Replace sensitive data with fictitious but realistic data.    - Shuffling: Randomly reorder data within a column.    - Encryption: Encrypt sensitive data and decrypt it when needed.    - Nulling Out: Replace sensitive data with null values.    - Tokenization:...

LLM Deployment Pipeline with Azure and Kubeflow

To deploy model espcially LLM based application in Azure can be daunting task manually. We can automate the deployment pipeline with Kubeflow.  I am providing one example of an end-to-end machine learning deployment pipeline using Kubeflow on Azure. This example will cover setting up a Kubeflow pipeline, training a model, and deploying the model. Prerequisites: 1. Azure Account : You need an Azure account. 2. Azure Kubernetes Service (AKS) : You need a Kubernetes cluster. You can create an AKS cluster via the Azure portal or CLI. 3. Kubeflow : You need Kubeflow installed on your AKS cluster. Follow the [Kubeflow on Azure documentation](https://www.kubeflow.org/docs/azure/aks/) to set this up. Step 1: Setting Up the Environment First, ensure you have the Azure CLI and kubectl installed and configured. ```sh # Install Azure CLI curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash # Install kubectl az aks install-cli # Log in to Azure az login # Set the subscription (if you have mu...

Some Questions and Topics for Data Engineers and Data Architects

  How to do an incremental load in ADF? Incremental loading in Azure Data Factory (ADF) involves loading only the data that has changed since the last load. This can be achieved by using a combination of source system change tracking mechanisms (like timestamps or change data capture) and lookup activities in ADF pipelines to identify new or updated data. What is data profiling? Data profiling is the process of analyzing and understanding the structure, content, quality, and relationships within a dataset. It involves examining statistics, patterns, and anomalies to gain insights into the data and ensure its suitability for specific use cases like reporting, analytics, or machine learning. Difference between ETL and ELT? ETL (Extract, Transform, Load) involves extracting data from source systems, transforming it into a suitable format, and then loading it into a target system. ELT (Extract, Load, Transform) involves loading raw data into a target system first, then transforming it ...

Stream Processing Window Functions

  Photo by João Jesus: pexel A common goal of stream processing is to aggregate events into temporal intervals, or windows. For example, to count the number of social media posts per minute or to calculate the average rainfall per hour. Azure Stream Analytics includes native support for five kinds of temporal windowing functions. These functions enable you to define temporal intervals into which data is aggregated in a query. The supported windowing functions are Tumbling, Hopping, Sliding, Session, and Snapshot. No, these windowing functions are not exclusive to Azure Stream Analytics. They are commonly used concepts in stream processing and are available in various stream processing frameworks and platforms beyond Azure, such as Apache Flink, Apache Kafka Streams, and Apache Spark Streaming. The syntax and implementation might vary slightly between different platforms, but the underlying concepts remain the same. Five different types of Window functions Tumbling Window (Azure St...

Data Lake Comparison

AWS S3 (Simple Storage Service): Amazon Simple Storage Service (Amazon S3) is a scalable object storage service offered by Amazon Web Services (AWS). It provides developers and IT teams with secure, durable, and highly available storage infrastructure for a wide range of use cases, including data backup and recovery, data archiving, web and mobile applications, big data analytics, and content distribution. Key Features: 1. Scalability: Amazon S3 is designed to scale seamlessly from a few gigabytes to petabytes or more of data without any upfront provisioning. It can handle virtually unlimited amounts of data and requests. 2. Durability and Availability: S3 stores data redundantly across multiple devices and facilities within a region to ensure high durability and availability. It offers 99.999999999% (11 nines) durability and 99.99% availability SLA. 3. Security: S3 provides several security features to protect data at rest and in transit, including server-side encryption, encryption i...

Azure Data Factory Transform and Enrich Activity with Databricks and Pyspark

In #azuredatafactory at #transform and #enrich part can be done automatically or manually written by #pyspark two examples below one data source #csv another is #sqlserver with #incrementalloading Below is a simple end-to-end PySpark code example for a transform and enrich process in Azure Databricks. This example assumes you have a dataset stored in Azure Blob Storage, and you're using Azure Databricks for processing. ```python # Import necessary libraries from pyspark.sql import SparkSession from pyspark.sql.functions import col, lit, concat # Initialize SparkSession spark = SparkSession.builder \     .appName("Transform and Enrich Process") \     .getOrCreate() # Read data from Azure Blob Storage df = spark.read.csv("wasbs://<container_name>@<storage_account>.blob.core.windows.net/<file_path>", header=True) # Perform transformations transformed_df = df.withColumn("new_column", col("old_column") * 2) # Enrich data enriched...

Handling Large Binary Data with Azure Synapse

  Photo by Gül Işık Handling large binary data in Azure Synapse When dealing with large binary data types like geography or image data in Azure Synapse, you may encounter challenges due to limitations in supported data types and column sizes. Let's take the example of a City table with a Location column holding geography data, which needs to be converted to a varbinary type during loading since Azure Synapse doesn't natively support geography types. Example: 1. Convert to varbinary: During loading, convert the geography data to varbinary. 2. Data Chunking: Since PolyBase supports varbinary up to 8000 bytes, data may get truncated. To overcome this, split the data into manageable chunks. 3. Temporary Staging: Create a temporary staging table for the Location column. 4. Chunk Processing: Split the location data into 8000-byte chunks for each city, resulting in 1 to N rows for each city. 5. Reassembly: Reassemble the chunks using T-SQL PIVOT operator to convert rows into colum...

Incremental Data Loading from Databases for ETL

  pexel Let first discuss what is incremental loading into the data warehouse by ETL from different data sources including databases. Incremental Loading into Data Warehouses: Incremental loading is crucial for efficiently updating data warehouses without reprocessing all data. It involves adding only new or modified data since the last update. Key aspects include: 1. Efficiency: Incremental loading reduces processing time and resource usage by only handling changes. 2. Change Detection: Techniques like timestamp comparison or change data capture (CDC) identify modified data. 3. Data Consistency: Ensure consistency by maintaining referential integrity during incremental updates. 4. Performance: Proper indexing, partitioning, and parallel processing enhance performance during incremental loads. 5. Logging and Auditing: Logging changes ensures traceability and facilitates error recovery in incremental loading processes. Incremental Loading Explained In contrast to a full load,...