Data Masking When Ingesting Into Databricks

Photo by Alba Leader

Data masking is a data security technique that involves hiding data by changing its original numbers and letters. It's a way to create a fake version of data that's similar enough to the actual data, while still protecting it. This fake data can then be used as a functional alternative when the real data isn't needed.

Unity Catalog is not a feature within Databricks. Instead, Databricks provides the Delta Lake feature, which includes data governance capabilities such as row filters and column masking.

Unity Catalog in Databricks allows you to apply data governance policies such as row filters and column masks to sensitive data. Let’s break it down:

Row Filters:
Row filters enable you to apply a filter to a table so that subsequent queries only return rows for which the filter predicate evaluates to true.
To create a row filter, follow these steps:
Write a SQL user-defined function (UDF) to define the filter policy.
CREATE FUNCTION <function_name> (<parametergoog_1380099708_name> <parameter_type>, ...) RETURN {filter clause whobe a boolean};
Apply the row filter to an existing table using the following syntax:
ALTER TABLE <table_name> SET ROW FILTER <function_name> ON (<column_name>, ...);
You can also specify a row filter during the initial table creation.
Each table can have only one row filter, and it accepts input parameters that bind to specific columns of the table.
Column Masks:
Column masks allow you to transform or mask specific column values before returning them in query results.
To apply column masks:
Create a function that defines the masking logic.
Apply the masking function to a table column using an ALTER TABLE statement.
Alternatively, you can apply the masking function during table creation.
Unity Catalog Best Practices:
When setting up Unity Catalog, consider assigning a location to a catalog level. For example:
CREATE CATALOG hr_prod LOCATION 'abfss://mycompany-hr-prod@storage-account.dfs.core.windows.net/unity-catalog';

You can apply column masks to transform or conceal specific column values before returning them in query results. Here’s how you can achieve this:
Create a Masking Function:
Define a function that specifies the masking logic. This function will be used to transform the column values.
For example, let’s say you want to mask the last four digits of a credit card number. You can create a masking function that replaces the last four digits with asterisks.
Apply the Masking Function to a Column:
Use an ALTER TABLE statement to apply the masking function to a specific column.
For instance, if you have a column named credit_card_number, you can apply the masking function to it:
ALTER TABLE my_table SET COLUMN MASK credit_card_number USING my_masking_function;
Example Masking Function:
Suppose you want to mask the last four digits of a credit card number with asterisks. You can create a masking function like this:
CREATE FUNCTION my_masking_function AS BEGIN RETURN CONCAT('************', RIGHT(credit_card_number, 4)); END;
Query the Table:
When querying the table, the masked values will be returned instead of the original values.

Let’s focus on how you can achieve column masking in Databricks using Delta Lake:

Column Masking:
- Delta Lake allows you to apply column-level transformations or masks to sensitive data.
- You can create custom masking functions to modify specific column values before returning them in query results.
Creating a Masking Function:
- Define a user-defined function (UDF) that specifies the masking logic. For example, you can create a function that masks the last four digits of a credit card number.
- Here’s an example of a masking function that replaces the last four digits with asterisks:
```
def mask_credit_card(card_number):
    return "************" + card_number[-4:]
```

Applying the Masking Function:

Use the withColumn method to apply the masking function to a specific column in your DataFrame.

For instance, if you have a DataFrame named my_table with a column named credit_card_number, you can apply the masking function as follows:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

# Register the UDF
spark.udf.register("mask_credit_card", mask_credit_card, StringType())

# Apply the masking function to the column
masked_df = my_table.withColumn("masked_credit_card", udf("credit_card_number"))

Querying the Masked Data:
- When querying the masked_df, the transformed (masked) values will be returned for the masked_credit_card column.

You can find different related articles here kindly search.

Think Different

Search This Blog

Data Masking When Ingesting Into Databricks

Labels

Comments

Popular posts from this blog

Financial Engineering

Wholesale Customer Solution with Magento Commerce

How to Prepare for AI Driven Career