To create the different jobs displayed in this article, you have to create a repository : With VALUES
List file in HDFS (with all versions of Data Fabric)
Create a new job
Add the component "tHDFSConnection" : Allows the creation of a HDFS connection.
Add the component "tHDFSList": List the different files contents in the hdfs folder.
Add the component "tHDFSProperties": Display the properties of the different files (Example : mode, time, directory name...)
Add the component "tLogRow': Display the result.
Create links:
"tHDFSConnection" is connected with "tHDFSList" (through "OnSubjobOk")
"tHDFSList" is connected with "tHDFSProperties" (through "Iterate")
"tHDFSProperties" is connected with "tLogRun" (through "Main")
Double click on "tHDFSConnection" and set its properties:
Add a "Cloudera" distribution and select the latest version of Cloudera
Enter the Name Node URL. The URL has to respect this format : "hdfs://ip_hdfs:port_hdfs/" Use context variables if possible : "hdfs://"+context.IP_HDFS+":"+context.Port_HDFS+"/"
Add the user
Double click on "tHDFSList" and set its properties:
Check "Use an existing connection" and select the connection made by the component "tHDFSConnection"
Add a hdfs folder: context.Folder_HDFS
Add a Filemask. In the example, the filemask is "*" because this job is looking up every file. If you want to only search for files ending with the extension ".csv", you can enter "*.csv". The star means "whatever" before ".csv".
In "Sort", select "Name of file"
Double click on "tHDFSProperties" :
Tick "Use an existing connection"
Add a file: ((String)globalMap.get("tHDFSList_1_CURRENT_FILEPATH")) This command use the current file of the component tHDFS_List.