Pyspark - Read & Write files from HDFS
GitHub Page : exemple-pyspark-read-and-write
Common parts
Libraries dependency
Import dependencies
from pyspark.sql import SparkSession
Creating Spark Context
Create Spark Session
sparkSession = SparkSession.builder.appName("example-pyspark-read-and-write").getOrCreate() data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)] df = sparkSession.createDataFrame(data)
How to write a file to HDFS?
Code example
Write a file into HDFS
# Write into HDFS df.write.csv("hdfs://cluster/user/hdfs/test/example.csv")
How to read a file from HDFS?
Code example
This Code only shows the first 20 records of the file.
Read From HDFS
# Read from HDFS df_load = sparkSession.read.csv('hdfs://cluster/user/hdfs/test/example.csv') df_load.show()
How to use on Data Fabric?
In order to run any PySpark job on Data Fabric, you must package your python source file into a zip file. Pay attention that the file name must be __main__.py
How to use on Data Fabric's Jupyter Notebooks?
Prior to spark session creation, you must add the following snippet:
Notebook configuration
import os os.environ["HADOOP_USER_NAME"] = "hdfs" os.environ["PYTHON_VERSION"] = "3.5.2"
At time of writing only 2 pythons versions are available: 3.5.2 and 2.7.13