Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark environment.

Overview

pyspark-anonymizer

Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark environment.

Installing

pip install pyspark-anonymizer

Usage

Before Masking

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("your_app_name").getOrCreate()
df = spark.read.parquet("s3://amazon-reviews-pds/parquet/product_category=Electronics/")
df.limit(5).toPandas()
marketplace customer_id review_id product_id product_parent product_title star_rating helpful_votes total_votes vine verified_purchase review_headline review_body review_date year
0 US 51163966 R2RX7KLOQQ5VBG B00000JBAT 738692522 Diamond Rio Digital Player 3 0 0 N N Why just 30 minutes? RIO is really great, but Diamond should increa... 1999-06-22 1999
1 US 30050581 RPHMRNCGZF2HN B001BRPLZU 197287809 NG 283220 AC Adapter Power Supply for HP Pavil... 5 0 0 N Y Five Stars Great quality for the price!!!! 2014-11-17 2014
2 US 52246039 R3PD79H9CTER8U B00000JBAT 738692522 Diamond Rio Digital Player 5 1 2 N N The digital audio "killer app" One of several first-generation portable MP3 p... 1999-06-30 1999
3 US 16186332 R3U6UVNH7HGDMS B009CY43DK 856142222 HDE Mini Portable Capsule Travel Mobile Pocket... 5 0 0 N Y Five Stars I like it, got some for the Grandchilren 2014-11-17 2014
4 US 53068431 R3SP31LN235GV3 B00000JBSN 670078724 JVC FS-7000 Executive MicroSystem (Discontinue... 3 5 5 N N Design flaws ruined the better functions I returned mine for a couple of reasons: The ... 1999-07-13 1999

After Masking

In this example we will add the following data anonymizers:

  • drop_column on column "marketplace"
  • replace all values to "*" of the "customer_id" column
  • replace_with_regex "R\d" (R and any digit) to "*" on "review_id" column
  • sha256 on "product_id" column
  • filter_row with condition "product_parent != 738692522"
from pyspark.sql import SparkSession
import pyspark.sql.functions as spark_functions
import pyspark_anonymizer

spark = SparkSession.builder.appName("your_app_name").getOrCreate()
df = spark.read.parquet("s3://amazon-reviews-pds/parquet/product_category=Electronics/")

dataframe_anonymizers = [
    {
        "method": "drop_column",
        "parameters": {
            "column_name": "marketplace"
        }
    },
    {
        "method": "replace",
        "parameters": {
            "column_name": "customer_id",
            "replace_to": "*"
        }
    },
    {
        "method": "replace_with_regex",
        "parameters": {
            "column_name": "review_id",
            "replace_from_regex": "R\d",
            "replace_to": "*"
        }
    },
    {
        "method": "sha256",
        "parameters": {
            "column_name": "product_id"
        }
    },
    {
        "method": "filter_row",
        "parameters": {
            "where": "product_parent != 738692522"
        }
    }
]

df_parsed = pyspark_anonymizer.Parser(df, dataframe_anonymizers, spark_functions).parse()
df_parsed.limit(5).toPandas()
customer_id review_id product_id product_parent product_title star_rating helpful_votes total_votes vine verified_purchase review_headline review_body review_date year
0 * RPHMRNCGZF2HN 69031b13080f90ae3bbbb505f5f80716cd11c4eadd8d86... 197287809 NG 283220 AC Adapter Power Supply for HP Pavil... 5 0 0 N Y Five Stars Great quality for the price!!!! 2014-11-17 2014
1 * *U6UVNH7HGDMS c99947c06f65c1398b39d092b50903986854c21fd1aeab... 856142222 HDE Mini Portable Capsule Travel Mobile Pocket... 5 0 0 N Y Five Stars I like it, got some for the Grandchilren 2014-11-17 2014
2 * *SP31LN235GV3 eb6b489524a2fb1d2de5d2e869d600ee2663e952a4b252... 670078724 JVC FS-7000 Executive MicroSystem (Discontinue... 3 5 5 N N Design flaws ruined the better functions I returned mine for a couple of reasons: The ... 1999-07-13 1999
3 * *IYAZPPTRJF7E 2a243d31915e78f260db520d9dcb9b16725191f55c54df... 503838146 BlueRigger High Speed HDMI Cable with Ethernet... 3 0 0 N Y Never got around to returning the 1 out of 2 ... Never got around to returning the 1 out of 2 t... 2014-11-17 2014
4 * *RDD9FILG1LSN c1f5e54677bf48936fb1e9838869630e934d16ac653b15... 587294791 Brookstone 2.4GHz Wireless TV Headphones 5 3 3 N Y Saved my. marriage, I swear to god. Saved my.marriage, I swear to god. 2014-11-17 2014

Anonymizers from DynamoDB

You can store anonymizers on DynamoDB too.

Creating DynamoDB table

To create the table follow the steps below.

Using example script

On AWS console:

  • DynamoDB > Tables > Create table
  • Table name: "pyspark_anonymizer" (or any other of your own)
  • Partition key: "dataframe_name"
  • Customize the settings if you want
  • Create table

Writing Anonymizer on DynamoDB

You can run the example script, then edit your settings from there.

Parse from DynamoDB

from pyspark.sql import SparkSession
import pyspark.sql.functions as spark_functions
import pyspark_anonymizer
import boto3
from botocore.exceptions import ClientError as client_error

dynamo_table = "pyspark_anonymizer"
dataframe_name = "table_x"

dynamo_table = boto3.resource('dynamodb').Table(dynamo_table)
spark = SparkSession.builder.appName("your_app_name").getOrCreate()
df = spark.read.parquet("s3://amazon-reviews-pds/parquet/product_category=Electronics/")

df_parsed = pyspark_anonymizer.ParserFromDynamoDB(df, dataframe_name, dynamo_table, spark_functions, client_error).parse()

df_parsed.limit(5).toPandas()

The output will be same as the previous. The difference is that the anonymization settings will be in DynamoDB

Currently supported data masking/anonymization methods

  • Methods
    • drop_column - Drop a column.
    • replace - Replace all column to a string.
    • replace_with_regex - Replace column contents with regex.
    • sha256 - Apply sha256 hashing function.
    • filter_row - Apply a filter to the dataframe.
2D fluid simulation implementation of Jos Stam paper on real-time fuild dynamics, including some suggested extensions.

Fluid Simulation Usage Download this repo and store it in your computer. Open a terminal and go to the root directory of this folder. Make sure you ha

Mariana Ávalos Arce 5 Dec 02, 2022
Machine Learning from Scratch

Machine Learning from Scratch Author: Shengxuan Wang From: Oregon State University Content: Building Machine Learning model from Scratch, without usin

ShawnWang 0 Jul 05, 2022
Neighbourhood Retrieval (Nearest Neighbours) with Distance Correlation.

Neighbourhood Retrieval with Distance Correlation Assign Pseudo class labels to datapoints in the latent space. NNDC is a slim wrapper around FAISS. N

The Learning Machines 1 Jan 16, 2022
CyLP is a Python interface to COIN-OR’s Linear and mixed-integer program solvers (CLP, CBC, and CGL)

CyLP CyLP is a Python interface to COIN-OR’s Linear and mixed-integer program solvers (CLP, CBC, and CGL). CyLP’s unique feature is that you can use i

COIN-OR Foundation 161 Dec 14, 2022
We have a dataset of user performances. The project is to develop a machine learning model that will predict the salaries of baseball players.

Salary-Prediction-with-Machine-Learning 1. Business Problem Can a machine learning project be implemented to estimate the salaries of baseball players

Ayşe Nur Türkaslan 9 Oct 14, 2022
This is the code repository for LRM Stochastic watershed model.

LRM-Squannacook Input data for generating stochastic streamflows are observed and simulated timeseries of streamflow. their format needs to be CSV wit

1 Feb 14, 2022
This is an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch

This is an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch. It uses a simple TestEnvironment to test the algorithm

Martin Huber 59 Dec 09, 2022
Distributed Evolutionary Algorithms in Python

DEAP DEAP is a novel evolutionary computation framework for rapid prototyping and testing of ideas. It seeks to make algorithms explicit and data stru

Distributed Evolutionary Algorithms in Python 4.9k Jan 05, 2023
A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

2.3k Jan 05, 2023
A Python step-by-step primer for Machine Learning and Optimization

early-ML Presentation General Machine Learning tutorials A Python step-by-step primer for Machine Learning and Optimization This github repository gat

Dimitri Bettebghor 8 Dec 01, 2022
Mortality risk prediction for COVID-19 patients using XGBoost models

Mortality risk prediction for COVID-19 patients using XGBoost models Using demographic and lab test data received from the HM Hospitales in Spain, I b

1 Jan 19, 2022
2021 Machine Learning Security Evasion Competition

2021 Machine Learning Security Evasion Competition This repository contains code samples for the 2021 Machine Learning Security Evasion Competition. P

Fabrício Ceschin 8 May 01, 2022
Titanic Traveller Survivability Prediction

The aim of the mini project is predict whether or not a passenger survived based on attributes such as their age, sex, passenger class, where they embarked and more.

John Phillip 0 Jan 20, 2022
A Tools that help Data Scientists and ML engineers train and deploy ML models.

Domino Research This repo contains projects under active development by the Domino R&D team. We build tools that help Data Scientists and ML engineers

Domino Data Lab 73 Oct 17, 2022
GAM timeseries modeling with auto-changepoint detection. Inspired by Facebook Prophet and implemented in PyMC3

pm-prophet Pymc3-based universal time series prediction and decomposition library (inspired by Facebook Prophet). However, while Faceook prophet is a

Luca Giacomel 314 Dec 25, 2022
Datetimes for Humans™

Maya: Datetimes for Humans™ Datetimes are very frustrating to work with in Python, especially when dealing with different locales on different systems

Timo Furrer 3.4k Dec 28, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 08, 2023
STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks

STUMPY STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of tim

TD Ameritrade 2.5k Jan 06, 2023
scikit-learn models hyperparameters tuning and feature selection, using evolutionary algorithms.

Sklearn-genetic-opt scikit-learn models hyperparameters tuning and feature selection, using evolutionary algorithms. This is meant to be an alternativ

Rodrigo Arenas 180 Dec 20, 2022
Timeseries analysis for neuroscience data

=================================================== Nitime: timeseries analysis for neuroscience data ===============================================

NIPY developers 212 Dec 09, 2022