anonyspark
is a lightweight Python package for schema-driven data masking and anonymization in PySpark DataFrames. Designed for ML engineers, data analysts, and compliance teams working with sensitive data in big data environments, it helps enforce data privacy, PII redaction, and regulatory compliance (e.g., HIPAA, GDPR).
In enterprise data pipelines, personally identifiable information (PII) and sensitive fields are often left exposed in logs, training data, or staging zones. anonyspark
solves this by enabling deterministic and schema-aware masking of such fields directly in Spark, without leaving the distributed environment.
pip install anonyspark
PyPi link: https://pypi.org/project/anonyspark-core
License: MIT License