Showing posts with label data science. Show all posts
Showing posts with label data science. Show all posts

Monday

Azure Data Factory, ADSL Gen2 BLOB Storage and Syncing Data from Share Point Folder

 

Photo by Manuel Geissinger

Today we are going to discuss data sync between on premisses SharePoint folder and Azure BLOB Storage. 

When we need to upload or download files from SharePoint folder within the home network to Azure. We must consider the best way to auto sync as well. Let's discuss them step by step.

Azure Data Factory (ADF) is a powerful cloud-based service provided by Microsoft Azure. Let me break it down for you:

  1. Purpose and Context:

    • In the world of big data, we often deal with raw, unorganized data stored in various systems.
    • However, raw data alone lacks context and meaning for meaningful insights.
    • Azure Data Factory (ADF) steps in to orchestrate and operationalize processes, transforming massive raw data into actionable business insights.
  2. What Does ADF Do?:

    • ADF is a managed cloud service designed for complex data integration projects.
    • It handles hybrid extract-transform-load (ETL) and extract-load-transform (ELT) scenarios.
    • It enables data movement and transformation at scale.
  3. Usage Scenarios:

    • Imagine a gaming company collecting petabytes of game logs from cloud-based games.
    • The company wants to:
      • Analyze these logs for customer insights.
      • Combine on-premises reference data with cloud log data.
      • Process the joined data using tools like Azure HDInsight (Spark cluster).
      • Publish transformed data to Azure Synapse Analytics for reporting.
    • ADF automates this workflow, allowing daily scheduling and execution triggered by file arrivals in a blob store container.
  4. Key Features:

    • Data-Driven Workflows: Create and schedule data-driven workflows (called pipelines).
    • Ingestion: Ingest data from disparate data stores.
    • Transformation: Build complex ETL processes using visual data flows or compute services like Azure HDInsight Hadoop, Azure Databricks, and Azure SQL Database.
    • Publishing: Publish transformed data to destinations like Azure Synapse Analytics for business intelligence applications.
  5. Why ADF Matters:

    • It bridges the gap between raw data and actionable insights.
    • Businesses can make informed decisions based on unified data insights.

Learn more about Azure Data Factory on Microsoft Learn1.

Azure Data Factory (ADF) can indeed sync data between on-premises SharePoint folders and Azure Blob Storage. Let’s break it down:

  1. Syncing with On-Premises SharePoint Folder:

    • ADF allows you to copy data from a SharePoint Online List (which includes folders) to various supported data stores.
    • Here’s how you can set it up:
      • Prerequisites:
        • Register an application with the Microsoft identity platform.
        • Note down the Application ID, Application key, and Tenant ID.
        • Grant your registered application permission in your SharePoint Online site.
      • Configuration:
  2. Syncing with Azure Blob Storage:

  3. Combining Both:

    • To sync data between an on-premises SharePoint folder and Azure Blob Storage:
      • Set up your SharePoint linked service.
      • Set up your Azure Blob Storage linked service.
      • Create a pipeline that uses the Copy activity to move data from SharePoint to Blob Storage.
      • Optionally, apply any necessary transformations using the Data Flow activity.

Remember, ADF is your orchestration tool, ensuring seamless data movement and transformation across various data sources and sinks.

On the other hand, Azure Data Lake Storage Gen2 (ADLS Gen2) is a powerful service in the Microsoft Azure ecosystem. Let’s explore how to use it effectively:

  1. Overview of ADLS Gen2:

    • ADLS Gen2 combines the capabilities of a data lake with the scalability and performance of Azure Blob Storage.
    • It’s designed for handling large volumes of diverse data, making it ideal for big data analytics and data warehousing scenarios.
  2. Best Practices for Using ADLS Gen2:

    • Optimize Performance:
      • Consider using a premium block blob storage account if your workloads require low latency and high I/O operations per second (IOP).
      • Premium accounts store data on solid-state drives (SSDs) optimized for low latency and high throughput.
      • While storage costs are higher, transaction costs are lower.
    • Reduce Costs:
      • Organize your data into data sets within ADLS Gen2.
      • Provision separate ADLS Gen2 accounts for different data landing zones.
      • Evaluate feature support and known issues to make informed decisions.
    • Security and Compliance:
      • Use service principals or access keys to access ADLS Gen2.
      • Understand terminology differences (e.g., blobs vs. files).
      • Review the documentation for feature-specific guidance.
    • Integration with Other Services:
      • Mount ADLS Gen2 to Azure Databricks for reading and writing data.
      • Compare ADLS Gen2 with Azure Blob Storage for different use cases.
      • Understand where ADLS Gen2 fits in the stages of analytical processing.
  3. Accessing ADLS Gen2:

    • You can access ADLS Gen2 in three ways:
      • Mounting it to Azure Databricks using a service principal or OAuth 2.0.
      • Directly using a service principal.
      • Using the ADLS Gen2 storage account access key directly.

Remember, ADLS Gen2 empowers you to manage and analyze vast amounts of data efficiently. Dive into the documentation and explore its capabilities! 

Learn more about Azure Data Lake Storage Gen2 on Microsoft Learn1.

Let’s set up a data flow that automatically copies files from an on-premises SharePoint folder to Azure Data Lake Storage Gen2 (ADLS Gen2) whenever new files are uploaded. Here are the steps:

  1. Prerequisites:

    • Ensure you have the following:
      • An Azure subscription (create one if needed).
      • An Azure Storage account with ADLS Gen2 enabled.
      • An on-premises SharePoint folder containing the files you want to sync.
  2. Create an Azure Data Factory (ADF):

    • If you haven’t already, create an Azure Data Factory using the Azure portal.
    • Launch the Data Integration application in ADF.
  3. Set Up the Copy Data Tool:

    • In the ADF home page, select the Ingest tile to launch the Copy Data tool.
    • Configure the properties:
      • Choose Built-in copy task under Task type.
      • Select Run once now under Task cadence or task schedule.
  4. Configure the Source (SharePoint):

    • Click + New connection.
    • Select SharePoint from the connector gallery.
    • Provide the necessary credentials and details for your on-premises SharePoint folder.
    • Define the source dataset.
  5. Configure the Destination (ADLS Gen2):

    • Click + New connection.
    • Select Azure Data Lake Storage Gen2 from the connector gallery.
    • Choose your ADLS Gen2 capable account from the “Storage account name” drop-down list.
    • Create the connection.
  6. Mapping and Transformation (Optional):

    • If needed, apply any transformations or mappings between the source and destination.
    • You can use the Data Flow activity for more complex transformations.
  7. Run the Pipeline:

    • Save your configuration.
    • Execute the pipeline to copy data from SharePoint to ADLS Gen2.
    • You can schedule this pipeline to run periodically or trigger it based on events (e.g., new files in SharePoint).
  8. Monitoring and Alerts:

    • Monitor the pipeline execution in the Azure portal.
    • Set up alerts for any failures or anomalies.

Remember to adjust the settings according to your specific SharePoint folder and ADLS Gen2 requirements. With this setup, your files will be automatically synced from SharePoint to ADLS Gen2 whenever new files are uploaded! 

Learn more about loading data into Azure Data Lake Storage Gen2 on Microsoft Learn1.

Tuesday

Data Masking When Ingesting Into Databricks

 

Photo by Alba Leader

Data masking is a data security technique that involves hiding data by changing its original numbers and letters. It's a way to create a fake version of data that's similar enough to the actual data, while still protecting it. This fake data can then be used as a functional alternative when the real data isn't needed. 



Unity Catalog is not a feature within Databricks. Instead, Databricks provides the Delta Lake feature, which includes data governance capabilities such as row filters and column masking.

Unity Catalog in Databricks allows you to apply data governance policies such as row filters and column masks to sensitive data. Let’s break it down:

  1. Row Filters:

    • Row filters enable you to apply a filter to a table so that subsequent queries only return rows for which the filter predicate evaluates to true.
    • To create a row filter, follow these steps:
      1. Write a SQL user-defined function (UDF) to define the filter policy.
      • CREATE FUNCTION <function_name> (<parametergoog_1380099708_name> <parameter_type>, ...) RETURN {filter clause whobe a boolean};
  2. Apply the row filter to an existing table using the following syntax:
    ALTER TABLE <table_name> SET ROW FILTER <function_name> ON (<column_name>, ...);
      1. You can also specify a row filter during the initial table creation.
    • Each table can have only one row filter, and it accepts input parameters that bind to specific columns of the table.
  3. Column Masks:

    • Column masks allow you to transform or mask specific column values before returning them in query results.
    • To apply column masks:
      1. Create a function that defines the masking logic.
      2. Apply the masking function to a table column using an ALTER TABLE statement.
      3. Alternatively, you can apply the masking function during table creation.
  4. Unity Catalog Best Practices:

  5. When setting up Unity Catalog, consider assigning a location to a catalog level. For example:
    CREATE CATALOG hr_prod
    LOCATION 'abfss://mycompany-hr-prod@storage-account.dfs.core.windows.net/unity-catalog';

You can apply column masks to transform or conceal specific column values before returning them in query results. Here’s how you can achieve this:

  1. Create a Masking Function:

    • Define a function that specifies the masking logic. This function will be used to transform the column values.
    • For example, let’s say you want to mask the last four digits of a credit card number. You can create a masking function that replaces the last four digits with asterisks.
  2. Apply the Masking Function to a Column:

    • Use an ALTER TABLE statement to apply the masking function to a specific column.
    • For instance, if you have a column named credit_card_number, you can apply the masking function to it:
      ALTER TABLE my_table SET COLUMN MASK credit_card_number USING my_masking_function;
      
  3. Example Masking Function:

    • Suppose you want to mask the last four digits of a credit card number with asterisks. You can create a masking function like this:
      CREATE FUNCTION my_masking_function AS
      BEGIN
          RETURN CONCAT('************', RIGHT(credit_card_number, 4));
      END;
      
  4. Query the Table:

    • When querying the table, the masked values will be returned instead of the original values.

Let’s focus on how you can achieve column masking in Databricks using Delta Lake:

  1. Column Masking:

    • Delta Lake allows you to apply column-level transformations or masks to sensitive data.
    • You can create custom masking functions to modify specific column values before returning them in query results.
  2. Creating a Masking Function:

    • Define a user-defined function (UDF) that specifies the masking logic. For example, you can create a function that masks the last four digits of a credit card number.
    • Here’s an example of a masking function that replaces the last four digits with asterisks:
      def mask_credit_card(card_number):
          return "************" + card_number[-4:]
      
  3. Applying the Masking Function:

    • Use the withColumn method to apply the masking function to a specific column in your DataFrame.
    • For instance, if you have a DataFrame named my_table with a column named credit_card_number, you can apply the masking function as follows:
      from pyspark.sql.functions import udf
      from pyspark.sql.types import StringType
      
      # Register the UDF
      spark.udf.register("mask_credit_card", mask_credit_card, StringType())
      
      # Apply the masking function to the column
      masked_df = my_table.withColumn("masked_credit_card", udf("credit_card_number"))
      
  4. Querying the Masked Data:

    • When querying the masked_df, the transformed (masked) values will be returned for the masked_credit_card column.

You can find different related articles here kindly search.


Sunday

AI Integration

Following are some questions regarding Python and AI integration. 

1. What is AI integration in the context of cloud computing?

Answer: AI integration in cloud computing refers to the seamless incorporation of Artificial Intelligence services, frameworks, or models into cloud platforms. It allows users to leverage AI capabilities without managing the underlying infrastructure.

2. How can Python be used for AI integration in the cloud?

Answer: Python is widely used for AI integration in the cloud due to its extensive libraries and frameworks. Tools like TensorFlow, PyTorch, and scikit-learn are compatible with cloud platforms, enabling developers to deploy and scale AI models efficiently.

Also, it can use different MVC frameworks eg. FastAPI, Flask or serverless functions eg. Lmabda or Azure function

3. What are the benefits of integrating AI with cloud services?

Answer: Integrating AI with cloud services offers scalability, cost-effectiveness, and accessibility. It allows businesses to leverage powerful AI capabilities without investing heavily in infrastructure, facilitating easy deployment, and enabling global accessibility.

4. Explain the role of cloud-based AI services like AWS SageMaker or Azure Machine Learning in Python.

Answer: Cloud-based AI services provide managed environments for building, training, and deploying machine learning models. In Python, libraries like Boto3 (for AWS) or Azure SDK facilitate interaction with these services, allowing seamless integration with Python-based AI workflows.

5. How can you handle large-scale AI workloads in the cloud using Python?

Answer: Python's parallel processing capabilities and cloud-based services like AWS Lambda or Google Cloud Functions can be used to distribute and scale AI workloads. Additionally, containerization tools like Docker and Kubernetes enhance portability and scalability.

6. Discuss considerations for security and compliance when integrating AI with cloud platforms in Python.

Answer: Security measures such as encryption, access controls, and secure APIs are crucial. Compliance with data protection regulations must be ensured. Python libraries like cryptography and secure cloud configurations play a role in implementing robust security practices.

7. How do you optimize costs while integrating AI solutions into cloud environments using Python?

Answer: Implement cost optimization strategies such as serverless computing, auto-scaling, and resource-efficient algorithms. Cloud providers offer pricing models that align with usage, and Python scripts can be optimized for efficient resource utilization.

8. Can you provide examples of Python libraries/frameworks used for AI integration with cloud platforms?

Answer: TensorFlow, PyTorch, and scikit-learn are popular Python libraries for AI. For cloud integration, Boto3 (AWS), Azure SDK (Azure), and google-cloud-python (Google Cloud) are widely used.

9. Describe a scenario where serverless computing in the cloud is beneficial for AI integration using Python.

 Answer: Serverless computing is beneficial when dealing with sporadic AI workloads. For instance, using AWS Lambda functions triggered by specific events to execute Python scripts for processing images or analyzing data.

10. How can you ensure data privacy when deploying AI models on cloud platforms with Python?

Answer: Use encryption for data in transit and at rest. Implement access controls and comply with data protection regulations. Python libraries like PyCryptodome can be utilized for encryption tasks.



Thursday

GPU with Tensorflow

 


You might have used GPU for faster processing of your Machine Learning code with Pytorch. However, do you know that you can use that with Tensorflow as well?

Here are the steps on how to enable GPU acceleration for TensorFlow to achieve faster performance:

1. Verify GPU Compatibility:

  • Check for CUDA Support: Ensure your GPU has a compute capability of 3.5 or higher (check NVIDIA's website).
  • Install CUDA Toolkit and cuDNN: Download and install the appropriate CUDA Toolkit and cuDNN versions compatible with your TensorFlow version and GPU from NVIDIA's website.

2. Install GPU-Enabled TensorFlow:

  • Use pip: If you haven't installed TensorFlow yet, use the following command to install the GPU version:
    Bash
    pip install tensorflow-gpu
    
  • Upgrade Existing Installation: If you already have TensorFlow installed, upgrade it to the GPU version:
    Bash
    pip install --upgrade tensorflow-gpu
    

3. Verify GPU Detection:

  • Run a TensorFlow script: Create a simple TensorFlow script and run it. If it detects your GPU, you'll see a message like "Found GPU at: /device:GPU:0".
  • Check in Python: You can also check within Python:
    Python
    import tensorflow as tf
    print(tf.config.list_physical_devices('GPU'))
    

4. Place Operations on GPU:

  • Manual Placement: Specify with tf.device('/GPU:0') to place operations on GPU:
    Python
    with tf.device('/GPU:0'):
        # Code to run on GPU
    
  • Automatic Placement: TensorFlow often places operations on the GPU automatically if available.

5. Monitor GPU Usage:

  • Tools: Use tools like NVIDIA System Management Interface (nvidia-smi) or TensorFlow's profiling tools to monitor GPU usage and memory during training.

Additional Tips:

  • TensorFlow Version: Ensure your TensorFlow version is compatible with your CUDA and cuDNN versions.
  • Multiple GPUs: If you have multiple GPUs, TensorFlow can utilize them by setting tf.config.set_visible_devices().
  • Performance Optimization: Explore techniques like mixed precision training and XLA compilation for further performance gains.

Remember:

  • Consult TensorFlow's documentation for the most up-to-date instructions and troubleshooting tips. https://www.tensorflow.org/guide/gpu
  • GPU acceleration can significantly improve performance, especially for large models and datasets.

AI Assistant For Test Assignment

  Photo by Google DeepMind Creating an AI application to assist school teachers with testing assignments and result analysis can greatly ben...