Talend - Write files with HDFS


Configuration : Context

To create the different jobs displayed in this article, you have to create a repository : WITH VALUES

Write file on HDFS (with all versions of Data Fabric)

  • Create a new job
  • Add the component "tHDFSConnection" : Allows the creation of a HDFS connection.
  • Add the component "tFileInputDelimited": Reads a file located on your computer.
  • Add the component "tHDFSOutput": Writes data to HDFS.
  • Create links:

    • "tHDFSConnection" is connected with "tFileInputDelimited" (through "OnSubjobOk")

    • "tFileInputDelimited" is connected with "tHDFSOutput" (through "Main")

  • Double click on "tHDFSConnection" and set its properties:
    • Add a "Cloudera" distribution and select the latest version of Cloudera
    • Enter the Name Node URL. 
      The URL has to respect this format : "hdfs://ip_hdfs:port_hdfs/"
      Use context variables if possible : "hdfs://"+context.IP_HDFS+":"+context.Port_HDFS+"/" 
    • Add the user
    • Uncheck "Use Datanode Hostname"

  • Double click on the component "tFileInputDelimited" :
    • Add the name of the local file (with its path)
    • If you want, tick ".csv" and set your options.

  • Double click on the component "tHDFSOutput" :
    • Click on "Edit a schema"
      • Enter a variable "flow" in Input and Output
    • Check "Use an existing connection"
    • Enter the name of your file (in HDFS). If you want, you can change these options. 

  • Run the job