Synchronization with your data lake
Data governance performs several synchronizations with your datalake.
Hive Metastore synchronization
This synchronization allows to retrieve all tables and their metadatas. It's always executed at startup application and then by default every 30 min (0 */30 * * * *).
You can customize time execution and frequency using environment variable DATAGOV_TABLES_LAKE_SYNC_CRON with cron expression (set the environment variable then restart app).
Table row count synchronization
This synchronization allows to retrieve the number of rows in each table. By default, it's executed every hour at 15 and 45 minutes (0 15,45 * * * *).
You can customize time execution and frequency using environment variable DATAGOV_TABLES_ROWCOUNT_SYNC_CRON with cron expression (set the environment variable then restart app).
This synchronization can be resource intensive
Table last modification and size synchronization
This synchronization allows to retrieve the last modification date and the size of each table from files. By default, it's executed every hour at 10 and 40 minutes (0 10,40 * * * *).
You can customize time execution and frequency using environment variable DATAGOV_TABLES_DATESIZE_SYNC_CRON with cron expression (set the environment variable then restart app).
Files marked as dataset synchronization
This synchronization allows to synchronize files on HDFS which marked as dataset via data governance. By default, it's executed every hour at 20 and 50 minutes (0 20,50 * * * *).
You can customize time execution and frequency using environment variable DATAGOV_FILES_LAKE_SYNC_CRON with cron expression (set the environment variable then restart app).
Files marked as dataset row count synchronization
This synchronization allows to retrieve the number of rows in each file on HDFS which marked as dataset via data governance. By default, it's executed every hour at 0 and 30 minutes (0 0,30 * * * *).
You can customize time execution and frequency using environment variable DATAGOV_FILES_ROWCOUNT_SYNC_CRON with cron expression (set the environment variable then restart app).
This synchronization can be resource intensive
Files marked as dataset last modification and size synchronization
This synchronization allows to retrieve the last modification date and the size of each file on HDFS which mark as dataset via data governance. By default, it's executed every hour at 25 and 55 minutes (0 25,55 * * * *).
You can customize time execution and frequency using environment variable DATAGOV_FILES_DATESIZE_SYNC_CRON with cron expression (set the environment variable then restart app).
CRON syntax
Five values, star is for every values : * * * * * → seconds, minutes, hours, "day of the month", month, weekday
Month and weekday names can be given as the first three letters of the English names.
example :
- 0 0 * * * * : top of every hour of every day
- 0 0/30 8-10 * * * : 8:00, 8:30, 9:00, 9:30, 10:00 and 10:30 every day
- 0 15,45 * * * SAT,SUN : twice per hour at 15 and 45 minutes only Saturday and Sunday