Github project : https://github.com/saagie/example-R-querying-impala/tree/master/ODBC
Dependencies
R package
DBI: Standard database interface
odbc: Connect to ODBC databases using DBI
dplyr: Data manipulation library
dbplyr: Converts data manipulation written in R to SQL
implyr: Same as above but specific for Impala
Impala ODBC driver
Download the Impala ODBC drive from Cloudera https://www.cloudera.com/downloads/connectors/impala/odbc/2-5-39.html
On Windows
Install the .msi executable
On Linux
Install odbcinst tool
Code Block | ||
---|---|---|
| ||
sudo apt-get install odbcinst |
Install the .deb package (or other extension depending on your distribution)
Code Block | ||
---|---|---|
| ||
sudo dpkg -i clouderaimpalaodbc_*.deb |
Execute the following command line to register the driver as installed
Code Block | ||
---|---|---|
| ||
sudo odbcinst -i -d -f /opt/cloudera/impalaodbc/Setup/odbcinst.ini |
Write-up of automation of package installation on Saagie:
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
if (!length(find.package("odbc", quiet = TRUE))) {
# Install OS package for ODBC
system("sudo apt-get install -y unixodbc unixodbc-dev", intern = TRUE)
# Install R package for ODBC
install.packages("odbc", repos = "https://cloud.r-project.org")
}
# If Impala ODBC driver is missing:
if (!file.exists("/etc/odbcinst.ini") ||
!length(grep("Cloudera ODBC Driver for Impala 64-bit", readLines("/etc/odbcinst.ini"), ignore.case = TRUE))) {
# Retrieve Impala drivers for Debianoids from Cloudera
if (!dir.exists("driver-impala")) dir.create("driver-impala")
if (!length(Sys.glob("driver-impala/clouderahiveodbc_*.deb")))
download.file(paste0("https://downloads.cloudera.com/connectors/impala_odbc_2.5.40.1025/",
"Debian/clouderaimpalaodbc_2.5.40.1025-2_amd64.deb"),
destfile = "driver-impala/clouderaimpalaodbc_2.5.40.1025-2_amd64.deb")
# Install driver
system("sudo dpkg -i driver-impala/clouderaimpalaodbc_*.deb", intern = TRUE)
# Setup driver and /etc/odbcinst.ini file
system("sudo apt-get install -y odbcinst", intern = TRUE)
system("sudo odbcinst -i -d -f /opt/cloudera/impalaodbc/Setup/odbcinst.ini", intern = TRUE)
} |
Parameters
Driver: The name of the driver to use (Default name on Windows is usually "Cloudera ODBC Driver for Impala", Default name on Linux is usually "Cloudera ODBC Driver for Impala 64-bit")
Host: Ip or hostname of the Impala database
Port: Default is 21050
Schema: Schema in which to execute the queries
Additionnals parameters for authentification
AuthMech: The authentification mechanism, use 3 for authentification with user / password
UseSASL: Simple Authentification and Security Layer, use 1 for authentification with user / password
UID: Username
PWD: Password
Code explanation for querying impala
Loading ODBC driver and connection
Code Block |
---|
con <- DBI::dbConnect(odbc::odbc(), Driver = "Cloudera ODBC Driver for Impala 64-bit", Host = "host", Port = 21050, Schema = "default", # Remove the options below if you don't need authentification AuthMech = 3, UseSASL = 1, UID = "user", PWD = "password") |
Query examples
Show Tables
Code Block |
---|
# On the version 2.5.39 of the Linux Impala ODBC driver, the method dbListTables returns only the first letter of each schema and table # One workaround is to use plain SQL "show schemas" and "show tables" # List all tables from all schemas dbListTables(con) dbGetQuery(con, 'show schemas') dbGetQuery(con, 'show tables') |
Get all elements of a table
Code Block |
---|
sample <- DBI::dbReadTable(con, "sample") # Fetch all data from a table |
Execute queries in dplyr syntax
Code Block |
---|
# Create a lazy tbl from an Impala table forecast_db <- tbl(con, in_schema('forecast', 'forecast_weekly')) forecast_db # Print columns and column types # The query is written in dplyr syntax, but executed on a remote sql database query <- forecast_db %>% summarise(mean_forecast = forecast %>% as.numeric() %>% mean) show_query(query) # Show the query that will be executed query # Executes the lazy query when the result is printed # Example of usable dplyr verbs forecast_db %>% filter(prediction_date == max(prediction_date)) %>% group_by(reference) %>% summarise(forecast_mean = mean(forecast), forecast_max = max(forecast), nb_forecast_total = n()) sales_db <- tbl(con, in_schema('forecast', 'sales_weekly')) sales_db forecast_db <- forecast_db %>% mutate(reference = as.character(reference)) diff_db <- inner_join(forecast_db, sales_db, by=c('reference', 'year', 'week')) diff_by_ref <- diff_db %>% group_by(reference, year, week) %>% summarise(diff_by_week = abs(forecast - quantity)) %>% # Difference between forecast and reality for each prediction group_by(reference) %>% summarise(diff_by_ref = sum(diff_by_week)) # Sum of all differences for each reference diff_db # Executes all the lazy queries above |