Python - Read & Write files from Impala with Security
Gist Page : example-python-read-and-write-from-impala-with-security
Common part
Thrift-sasl
The script bellow don't work with thrift-sasl 0.3.0 but only with thrift-sasl 0.2.1.
Add thrift-sasl==0.2.1 to your requirement.txt file.
Libraries dependency
import ibis import pandas as pd import os
WEBHDFS URI
WEBHDFS URI are like that : http://namenodedns:port/user/hdfs/folder/file.csv
Default port is 50070
Impala Connection
Default port is 21050.
Connection
# Connecting to Impala by providing Impala host ip and port (21050 by default),credentials and a Webhdfs client hdfs = ibis.hdfs_connect(host=os.environ['IP_HDFS'], port=50070) client = ibis.impala.connect(host=os.environ['IP_IMPALA'], port=21050, hdfs_client=hdfs, user=os.environ['LDAP_USER'], password=os.environ['LDAP_PASSWORD'], auth_mechanism='PLAIN')
Impala over SSL
If your Impala is secured with SSL, you have to add the following parameters to your ibis.impala.connect() command:
- use_ssl=True → Mandatory. The client will communicate over SSL to the server.
- ca_cert=None → Optional. If not set, the certificate won't be validated so this may be a potential security issue. The certificate chain file is available on Saagie's servers at /data/ssl/certs/ca-chain.cert.pem
How to write an Impala table with Python ?
Code example
# Creating a simple pandas DataFrame with two columns liste_hello = ['hello1','hello2'] liste_world = ['world1','world2'] df = pd.DataFrame(data = {'hello' : liste_hello, 'world': liste_world}) # Writing Dataframe to Impala if table name doesn't exist db = client.database('default') if not client.exists_table('helloworld'): db.create_table('helloworld', df) t = db['helloworld'] t.execute()
How to query an Impala table with Python ?
Code example
# ====== Reading table ====== # Selecting data with a SQL query #limit=None to get the whole table, otherwise will only get 10000 first lines requete = client.sql('select * from helloworld') df = requete.execute(limit=None)
How to write an Impala table with Impala tables sources in Python ?
Code example
# Write in table C the join between tables A and B client.raw_sql('CREATE TABLE c STORED AS PARQUET AS SELECT a.col1, b.col2 FROM a INNER JOIN b ON (a.id=b.id)') # No data is incomming in Python