R - Query Impala with ODBC

Github project : https://github.com/saagie/example-R-querying-impala/tree/master/ODBC

Dependencies

R package

DBI: Standard database interface
odbc: Connect to ODBC databases using DBI
dplyr: Data manipulation library
dbplyr: Converts data manipulation written in R to SQL
implyr: Same as above but specific for Impala

Impala ODBC driver

Download the Impala ODBC drive from Cloudera https://www.cloudera.com/downloads/connectors/impala/odbc/2-5-39.html

On Windows

Install the .msi executable

On Linux

Install odbcinst tool

sudo apt-get install odbcinst

Install the .deb package (or other extension depending on your distribution)

sudo dpkg -i clouderaimpalaodbc_*.deb

Execute the following command line to register the driver as installed

 sudo odbcinst -i -d -f /opt/cloudera/impalaodbc/Setup/odbcinst.ini

Write-up of automation of package installation on Saagie

Package installation

if (!length(find.package("odbc", quiet = TRUE))) {
  
  # Install OS package for ODBC
  system("sudo apt-get install -y unixodbc unixodbc-dev", intern = TRUE)
  # Install R package for ODBC
  install.packages("odbc", repos = "https://cloud.r-project.org")
  
}

# If Impala ODBC driver is missing:
if (!file.exists("/etc/odbcinst.ini") || 
    !length(grep("Cloudera ODBC Driver for Impala 64-bit", readLines("/etc/odbcinst.ini"), ignore.case = TRUE))) { 
  
  # Retrieve Impala drivers for Debianoids from Cloudera
  if (!dir.exists("driver-impala")) dir.create("driver-impala")
  if (!length(Sys.glob("driver-impala/clouderahiveodbc_*.deb")))
    download.file(paste0("https://downloads.cloudera.com/connectors/impala_odbc_2.5.40.1025/",
                         "Debian/clouderaimpalaodbc_2.5.40.1025-2_amd64.deb"),
                  destfile = "driver-impala/clouderaimpalaodbc_2.5.40.1025-2_amd64.deb")
  
  # Install driver
  system("sudo dpkg -i driver-impala/clouderaimpalaodbc_*.deb", intern = TRUE)
  
  # Setup driver and /etc/odbcinst.ini file
  system("sudo apt-get install -y odbcinst", intern = TRUE)
  system("sudo odbcinst -i -d -f /opt/cloudera/impalaodbc/Setup/odbcinst.ini", intern = TRUE)
  
}

Parameters

Driver: The name of the driver to use (Default name on Windows is usually "Cloudera ODBC Driver for Impala", Default name on Linux is usually "Cloudera ODBC Driver for Impala 64-bit")

Host: Ip or hostname of the Impala database

Port: Default is 21050

Schema: Schema in which to execute the queries

Additionnals parameters for authentification

AuthMech: The authentification mechanism, use 3 for authentification with user / password

UseSASL: Simple Authentification and Security Layer, use 1 for authentification with user / password

UID: Username

PWD: Password

Code explanation for querying impala

Loading ODBC driver and connection

con <- DBI::dbConnect(odbc::odbc(),
                      Driver   = "Cloudera ODBC Driver for Impala 64-bit",
                      Host     = "host",
                      Port     = 21050,
                      Schema   = "default",
                      # Remove the options below if you don't need authentification
                      AuthMech = 3,
                      UseSASL  = 1,
                      UID      = "user",
                      PWD      = "password")

Query examples

Show Tables

# On the version 2.5.39 of the Linux Impala ODBC driver, the method dbListTables returns only the first letter of each schema and table
# One workaround is to use plain SQL "show schemas" and "show tables"

# List all tables from all schemas
dbListTables(con)
dbGetQuery(con, 'show schemas')
dbGetQuery(con, 'show tables')

Get all elements of a table

sample <- DBI::dbReadTable(con, "sample") # Fetch all data from a table

Execute queries in dplyr syntax

# Create a lazy tbl from an Impala table
forecast_db <- tbl(con, in_schema('forecast', 'forecast_weekly'))
forecast_db # Print columns and column types

# The query is written in dplyr syntax, but executed on a remote sql database
query <- forecast_db %>% 
    summarise(mean_forecast = forecast %>% as.numeric() %>% mean)

show_query(query) # Show the query that will be executed
query # Executes the lazy query when the result is printed

# Example of usable dplyr verbs
forecast_db %>% 
    filter(prediction_date == max(prediction_date)) %>%
    group_by(reference) %>%
    summarise(forecast_mean = mean(forecast), 
              forecast_max = max(forecast), 
              nb_forecast_total = n())

sales_db <- tbl(con, in_schema('forecast', 'sales_weekly'))
sales_db

forecast_db <- forecast_db %>%
    mutate(reference = as.character(reference))

diff_db <- inner_join(forecast_db, sales_db, by=c('reference', 'year', 'week'))

diff_by_ref <- diff_db %>% 
    group_by(reference, year, week) %>% 
    summarise(diff_by_week = abs(forecast - quantity)) %>% # Difference between forecast and reality for each prediction
    group_by(reference) %>%
    summarise(diff_by_ref = sum(diff_by_week)) # Sum of all differences for each reference

diff_db # Executes all the lazy queries above