anonyspark

anonyspark is a lightweight Python package for schema-driven data masking and anonymization in PySpark DataFrames. Designed for ML engineers, data analysts, and compliance teams working with sensitive data in big data environments, it helps enforce data privacy, PII redaction, and regulatory compliance (e.g., HIPAA, GDPR).


Motivation

In enterprise data pipelines, personally identifiable information (PII) and sensitive fields are often left exposed in logs, training data, or staging zones. anonyspark solves this by enabling deterministic and schema-aware masking of such fields directly in Spark, without leaving the distributed environment.


Key Features


Use Cases


Installation

pip install anonyspark

PyPi link: https://pypi.org/project/anonyspark-core

License: MIT License