Synchronization with your data lake

Data governance performs several synchronizations with your datalake.

Hive Metastore synchronization

This synchronization allows to retrieve all tables and their metadatas. It's always executed at startup application and then by default every 30 min (0 */30 * * * *).

You can customize time execution and frequency using environment variable DATAGOV_TABLES_LAKE_SYNC_CRON with cron expression (set the environment variable then restart app).

Table row count synchronization

This synchronization allows to retrieve the number of rows in each table. By default, it's executed every hour at 15 and 45 minutes (0 15,45 * * * *).

You can customize time execution and frequency using environment variable DATAGOV_TABLES_ROWCOUNT_SYNC_CRON with cron expression (set the environment variable then restart app).

This synchronization can be resource intensive

Table last modification and size synchronization

This synchronization allows to retrieve the last modification date and the size of each table from files. By default, it's executed every hour at 10 and 40 minutes (0 10,40 * * * *).

You can customize time execution and frequency using environment variable DATAGOV_TABLES_DATESIZE_SYNC_CRON with cron expression (set the environment variable then restart app).

Files marked as dataset synchronization

This synchronization allows to synchronize files on HDFS which marked as dataset via data governance. By default, it's executed every hour at 20 and 50 minutes (0 20,50 * * * *).

You can customize time execution and frequency using environment variable DATAGOV_FILES_LAKE_SYNC_CRON with cron expression (set the environment variable then restart app).

Files marked as dataset row count synchronization 

This synchronization allows to retrieve the number of rows in each file on HDFS which marked as dataset via data governance. By default, it's executed every hour at 0 and 30 minutes (0 0,30 * * * *).

You can customize time execution and frequency using environment variable DATAGOV_FILES_ROWCOUNT_SYNC_CRON with cron expression (set the environment variable then restart app).

This synchronization can be resource intensive

Files marked as dataset last modification and size synchronization

This synchronization allows to retrieve the last modification date and the size of each file on HDFS which mark as dataset via data governance. By default, it's executed every hour at 25 and 55 minutes (0 25,55 * * * *).

You can customize time execution and frequency using environment variable DATAGOV_FILES_DATESIZE_SYNC_CRON with cron expression (set the environment variable then restart app).

CRON syntax

Five values, star is for every values :  * * * * * → seconds, minutes, hours, "day of the month", month, weekday

Month and weekday names can be given as the first three letters of the English names. 

example : 

  • 0 0 * * * * : top of every hour of every day
  • 0 0/30 8-10 * * * : 8:00, 8:30, 9:00, 9:30, 10:00 and 10:30 every day
  • 0 15,45 * * * SAT,SUN : twice per hour at 15 and 45 minutes only Saturday and Sunday