R - Query Impala with ODBC
Github project : https://github.com/saagie/example-R-querying-impala/tree/master/ODBC
Dependencies
R package
DBI: Standard database interface
odbc: Connect to ODBC databases using DBI
dplyr: Data manipulation library
dbplyr: Converts data manipulation written in R to SQL
implyr: Same as above but specific for Impala
Impala ODBC driver
Download the Impala ODBC drive from Cloudera https://www.cloudera.com/downloads/connectors/impala/odbc/2-5-39.html
On Windows
Install the .msi executable
On Linux
Install odbcinst tool
sudo apt-get install odbcinst
Install the .deb package (or other extension depending on your distribution)
sudo dpkg -i clouderaimpalaodbc_*.deb
Execute the following command line to register the driver as installed
sudo odbcinst -i -d -f /opt/cloudera/impalaodbc/Setup/odbcinst.ini
Write-up of automation of package installation on Saagie
if (!length(find.package("odbc", quiet = TRUE))) { # Install OS package for ODBC system("sudo apt-get install -y unixodbc unixodbc-dev", intern = TRUE) # Install R package for ODBC install.packages("odbc", repos = "https://cloud.r-project.org") } # If Impala ODBC driver is missing: if (!file.exists("/etc/odbcinst.ini") || !length(grep("Cloudera ODBC Driver for Impala 64-bit", readLines("/etc/odbcinst.ini"), ignore.case = TRUE))) { # Retrieve Impala drivers for Debianoids from Cloudera if (!dir.exists("driver-impala")) dir.create("driver-impala") if (!length(Sys.glob("driver-impala/clouderahiveodbc_*.deb"))) download.file(paste0("https://downloads.cloudera.com/connectors/impala_odbc_2.5.40.1025/", "Debian/clouderaimpalaodbc_2.5.40.1025-2_amd64.deb"), destfile = "driver-impala/clouderaimpalaodbc_2.5.40.1025-2_amd64.deb") # Install driver system("sudo dpkg -i driver-impala/clouderaimpalaodbc_*.deb", intern = TRUE) # Setup driver and /etc/odbcinst.ini file system("sudo apt-get install -y odbcinst", intern = TRUE) system("sudo odbcinst -i -d -f /opt/cloudera/impalaodbc/Setup/odbcinst.ini", intern = TRUE) }
Parameters
Driver: The name of the driver to use (Default name on Windows is usually "Cloudera ODBC Driver for Impala", Default name on Linux is usually "Cloudera ODBC Driver for Impala 64-bit")
Host: Ip or hostname of the Impala database
Port: Default is 21050
Schema: Schema in which to execute the queries
Additionnals parameters for authentification
AuthMech: The authentification mechanism, use 3 for authentification with user / password
UseSASL: Simple Authentification and Security Layer, use 1 for authentification with user / password
UID: Username
PWD: Password
Code explanation for querying impala
Loading ODBC driver and connection
library(DBI) con <- dbConnect(odbc::odbc(), Driver = "Cloudera ODBC Driver for Impala 64-bit", Host = "host", # usually "dn1" (resolved automatically from the platform) Port = 21050, Schema = "default", # Remove the options below if you don't need authentification AuthMech = 3, UseSASL = 1, UID = "user", PWD = "password")
Impala over SSL
If your Impala is secured with SSL, you have to add the following parameters to your command:
- SSL=1 → Mandatory. The client will communicate over SSL to the server.
- AllowSelfSignedServerCert=1 → Optional. To allow authentication using self-signed certificates that have not been added to the list of
trusted certificates The certificate chain file is available on Saagie's servers at /data/ssl/certs/ca-chain.cert.pem - AllowHostNameCNMismatch=1 → Optional. To allow the common name of a CA-issued SSL certificate to not match the host name of the
Impala server.
Query examples
Show Tables
# On the version 2.5.39 of the Linux Impala ODBC driver, the method dbListTables returns only the first letter of each schema and table # One workaround is to use plain SQL "show schemas" and "show tables" # List all tables from all schemas dbListTables(con) dbGetQuery(con, 'show schemas') dbGetQuery(con, 'show tables')
Get all elements of a table
sample <- DBI::dbReadTable(con, "sample") # Fetch all data from a table
Execute queries in dplyr syntax
# Create a lazy tbl from an Impala table forecast_db <- tbl(con, in_schema('forecast', 'forecast_weekly')) forecast_db # Print columns and column types # The query is written in dplyr syntax, but executed on a remote sql database query <- forecast_db %>% summarise(mean_forecast = forecast %>% as.numeric() %>% mean) show_query(query) # Show the query that will be executed query # Executes the lazy query when the result is printed # Example of usable dplyr verbs forecast_db %>% filter(prediction_date == max(prediction_date)) %>% group_by(reference) %>% summarise(forecast_mean = mean(forecast), forecast_max = max(forecast), nb_forecast_total = n()) sales_db <- tbl(con, in_schema('forecast', 'sales_weekly')) sales_db forecast_db <- forecast_db %>% mutate(reference = as.character(reference)) diff_db <- inner_join(forecast_db, sales_db, by=c('reference', 'year', 'week')) diff_by_ref <- diff_db %>% group_by(reference, year, week) %>% summarise(diff_by_week = abs(forecast - quantity)) %>% # Difference between forecast and reality for each prediction group_by(reference) %>% summarise(diff_by_ref = sum(diff_by_week)) # Sum of all differences for each reference diff_db # Executes all the lazy queries above