GitHub Page : exemple-pyspark-read-and-write
from pyspark import SparkContext, SparkConf from pyspark.sql import SparkSession, HiveContext |
SparkContext.setSystemProperty("hive.metastore.uris", "thrift://nn1:9083") |
sparkSession = (SparkSession .builder .appName('example-pyspark-read-and-write-from-hive') .enableHiveSupport() .getOrCreate()) data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)] df = sparkSession.createDataFrame(data) |
Use SQL queries to create a table
# Write into Hive df.write.saveAsTable('example') |
Use SQL queries to read a table
# Read from Hive df_load = sparkSession.sql('SELECT * FROM example') df_load.show() |
In order to run any PySpark job on Data Fabric, you must package your python source file into a zip file. Pay attention that the file name must be __main__.py
Prior to spark session creation, you must add the following snippet:
import os os.environ["HADOOP_USER_NAME"] = "hdfs" os.environ["PYTHON_VERSION"] = "3.5.2" |
At time of writing only 2 pythons versions are available: 3.5.2 and 2.7.13