Talend - List file in HDFS


Configuration : Context

To create the different jobs displayed in this article, you have to create a repository : With VALUES

List file in HDFS (with all versions of Data Fabric)

  • Create a new job
  • Add the component "tHDFSConnection" : Allows the creation of a HDFS connection.
  • Add the component "tHDFSList": List the different files contents in the hdfs folder.
  • Add the component "tHDFSProperties": Display the properties of the different files (Example : mode, time, directory name...)
  • Add the component "tLogRow': Display the result.
  • Create links:
    • "tHDFSConnection" is connected with "tHDFSList" (through "OnSubjobOk")
    • "tHDFSList" is connected with "tHDFSProperties" (through "Iterate") 
    • "tHDFSProperties" is connected with "tLogRun" (through "Main")

  • Double click on "tHDFSConnection" and set its properties:
    • Add a "Cloudera" distribution and select the latest version of Cloudera
    • Enter the Name Node URL. 
      The URL has to respect this format : "hdfs://ip_hdfs:port_hdfs/"
      Use context variables if possible : "hdfs://"+context.IP_HDFS+":"+context.Port_HDFS+"/" 
    • Add the user

  • Double click on "tHDFSList" and set its properties:
    • Check "Use an existing connection" and select the connection made by the component "tHDFSConnection"
    • Add a hdfs folder: context.Folder_HDFS
    • Add a Filemask. 
      In the example, the filemask is "*" because this job is looking up every file.
      If you want to only search for files ending with the extension ".csv", you can enter "*.csv".
      The star means "whatever" before ".csv".
    • In "Sort", select "Name of file"

  • Double click on "tHDFSProperties" :
    • Tick "Use an existing connection"
    • Add a file: ((String)globalMap.get("tHDFSList_1_CURRENT_FILEPATH"))
      This command use the current file of the component tHDFS_List.

  • Run the job