Talend - Read files with HDFS

Preamble

Configuration: Context

To create the different jobs displayed in this article, you have to create a repository: With VALUES

Read a file from HDFS (In console) (with all versions of Data Fabric)

  • Create a new job
  • Add the component "tHDFSConnection" : Allows the creation of a HDFS connection.
  • Add the component "tHDFSInput": Read a file in the HDFS.
  • Add the component "tLogRow': Display the result.
  • Create links:
    • "tHDFSConnection" is connected with "tHDFSInput" (through "OnSubjobOk")
    • "tHDFSInput" is connected with "tLogRun" (through "Main")

  • Double click on "tHDFSConnection" and set its properties:
    • Add a "Cloudera" distribution and select the latest version of Cloudera
    • Enter the Name Node URL. 
      The URL has to respect this format : "hdfs://ip_hdfs:port_hdfs/"
      Use context variables if possible : "hdfs://"+context.IP_HDFS+":"+context.Port_HDFS+"/" 
    • Add the user
    • Uncheck "Use Datanode Hostname"

  • Double click on the component "tHDFSInput" :
    • Click on "Edit a schema"
      • Enter a variable "flow"
    • Tick "Use an existing connection"
    • Enter a file name

  • Run the job

Copy a file from HDFS to local computer (with all versions of Data Fabric)

  • Create a new job
  • Add the component "tHDFSConnection" : Allows the creation of a HDFS connection.
  • Add the component "tHDFSGet": Copy the HDFS file in the local directory.
  • Create links:
    • "tHDFSConnection" is connected with "tHDFSGet" (through "OnSubjobOk")

  • Double click on "tHDFSConnection" and set its properties:
    • Add a "Cloudera" distribution and select the latest version of Cloudera
    • Enter the Name Node URL. 
      The URL has to respect this format : "hdfs://ip_hdfs:port_hdfs/"
      Use context variables if possible : "hdfs://"+context.IP_HDFS+":"+context.Port_HDFS+"/" 
    • Add the user
    • Uncheck "Use Datanode Hostname"

  • Double Click on the component "tHDFSGet" :
    • Tick "Use an existing connection"
    • Add a HDFS folder
    • Add a local folder
    • Add a mask and set a new file name if needed

  • Run the job