Complete documentation on data import / processing and model creation is available here : http://docs.h2o.ai/h2o/latest-stable/h2o-docs/index.html
2 options are available, you can download H2O from internet every time you launch a job, or you can install it from HDFS to speed up the process, both options are described below
Run the script found here Upload H2O library to HDFS.
Download H2O from the following URL: http://h2o-release.s3.amazonaws.com/h2o/rel-wright/1/h2o-3.20.0.2.zip
Unzip it, go to the R/ folder, and upload the file "h2o_3.20.0.2.tar.gz" the HDFS in the folder of your choice (recommended in /user/h2o/install_R/).
Use the following code in you script to install H2O:
# Install the package directly from hdfs. Replace nn1 by the correct value if needed install.packages('http://nn1:50070/webhdfs/v1/user/h2o/install_R/h2o_3.20.0.2.tar.gz?op=OPEN', repos = NULL, type = 'source') library(h2o) |
If the previous code does not work you can try the alternatives below:
# This line works in the R capsule and notebooks. Replace nn1 by the correct value if needed download.file('http://nn1:50070/webhdfs/v1/user/hdfs/h2o_3.20.0.2.tar.gz?op=OPEN', destfile = 'h2o_3.20.0.2.tar.gz') # This line is simpler but only works in the capsule # system('hdfs dfs -get /user/hdfs/h2o_3.20.0.2.tar.gz', intern = T) install.packages('h2o_3.20.0.2.tar.gz', repos = NULL, type = 'source') library(h2o) |
pkgs <- c("RCurl","jsonlite") for (pkg in pkgs) { if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) } } install.packages("h2o", type="source", repos="http://h2o-release.s3.amazonaws.com/h2o/rel-wright/1/R") library(h2o) |
# Replace the ip by the correct value h2o.connect(ip = 'h2o_custom_url.internal.pX', port = 80) |
# Change the url as needed iris_h2o <- h2o.importFile('hdfs://nn1:8020/user/h2o/data/iris/iris.csv') |
# Change the url as needed iris_h2o <- as.h2o(iris_local) |
# Create a split for train and test dataset iris.split <- h2o.splitFrame(iris_h2o) train <- iris.split[[1]] test <- iris.split[[2]] # Create a Random forest model with our dataset as input rf <- h2o.randomForest(y = 'Species', training_frame = train, validation_frame = test) # Print the result in console rf # Results are also available in the H2O web interface, with more details than this simple print |
# Change the url as needed h2o.saveModel(rf, 'hdfs://nn1:8020/user/h2o/models/', force = T) |