Table of Contents |
---|
Document datasets
- Click the Datasets tab.
- Select the dataset you want to classify.
- Classify the dataset in this window.
...
Add a comment to the dataset
Document personal data
To access the personal data panel :
...
Tag dataset containing personal data
- Check "contains personal data" box if the dataset contains personal data
Document the consent and / or configure anonymization process
- Click to Edit Settings to document the consent and / or configure anonymization process
...
An default anonymization process in scala is proposed by Saagie. The code source is available here : https://github.com/saagie/outis.
You can use it as is, change it or replace it with one of your process.
To use it, build jar and create Spark processing job on platform with command line :
...
Code Block |
---|
spark-submit \ --conf "spark.executor.extraJavaOptions='-Dlog4j.configuration=log4j.xml'" \ --conf spark.ui.showConsoleProgress=false \ --driver-java-options "-Dlog4j.configuration=log4j.xml" \ {file} -u hdfs_user -t metastore_url -d datagov_user -p $ENV_VAR_PASSWORD datasetsToAnonymized_url callback_url |
where :
hdfs_user
= user to launch job - user must have right to write in hdfsmetastore_url
= url of the hive metastore (exp : thrift://nn1:9083)- datagov_user = user to access to Data Governance on the platform with right "Access all datasets" (may be the same as hdfs_user)
datasetsToAnonymized_url
= url to obtain datasets to anonymized (exp : http://{IP_DATAGOVERNANCE}:{PORT}/api/v1/datagovernance/platform/{PLATFORM_ID}/privacy/datasets)callback_url
= url to inform dataset is anonymized (exp : http://{IP_DATAGOVERNANCE}:{PORT}/api/v1/datagovernance/platform/{PLATFORM_ID}/privacy/events/datasetAnonymized)- $ENV_VAR_PASSWORD : environment variable for password
You can dowload the last version of the jar here :
Exceptions handling
No dataset anonymization if :
- You don't provide an entry date
- You don't provide a list of fields to anonymize
- If you provide a string field as an entry date without a pattern to parse the data
- Dataset isn't csv or parquet files on table
No record anonymization if :
...
This covers these types : Byte, Short, Int, Long, Float, Double and BigDecimal.
Error
You can find execution errors of the job in Error Output part of Job Logs.