Talend - List file in HDFS
Github Project :Â example-talend-list-file-in-hdfs
Preamble
Configuration : Context
To create the different jobs displayed in this article, you have to create a repository :Â With VALUES
List file in HDFSÂ (with all versions of Data Fabric)
- Create a new job
- Add the component "tHDFSConnection" : Allows the creation of a HDFS connection.
- Add the component "tHDFSList": List the different files contents in the hdfs folder.
- Add the component "tHDFSProperties": Display the properties of the different files (Example : mode, time, directory name...)
- Add the component "tLogRow': Display the result.
- Create links:
- "tHDFSConnection" is connected with "tHDFSList" (through "OnSubjobOk")
- "tHDFSList" is connected with "tHDFSProperties" (through "Iterate")Â
- "tHDFSProperties" is connected with "tLogRun" (through "Main")
- Double click on "tHDFSConnection" and set its properties:
- Add a "Cloudera" distribution and select the latest version of Cloudera
- Enter the Name Node URL.Â
The URL has to respect this format : "hdfs://ip_hdfs:port_hdfs/"
Use context variables if possible : "hdfs://"+context.IP_HDFS+":"+context.Port_HDFS+"/"Â - Add the user
- Double click on "tHDFSList" and set its properties:
- Check "Use an existing connection" and select the connection made by the component "tHDFSConnection"
- Add a hdfs folder: context.Folder_HDFS
- Add a Filemask.Â
In the example, the filemask is "*" because this job is looking up every file.
If you want to only search for files ending with the extension ".csv", you can enter "*.csv".
The star means "whatever" before ".csv". - In "Sort", select "Name of file"
- Double click on "tHDFSProperties" :
- Tick "Use an existing connection"
- Add a file:Â ((String)globalMap.get("tHDFSList_1_CURRENT_FILEPATH"))
This command use the current file of the component tHDFS_List.
- Run the job