I have 2 docker containers running my webapp and my machine learning application, both using h2o. Initially, I had both calling h2o.init() and pointing to the same IP:PORT, consequently a single h2o cluster with one node was initialized.
Consider that I have a model already trained and now I'm training a second one. During this training process, if the webapp made a call to the h2o cluster (e.g., requesting a predict from the first model), it would kill the training process (error message bellow), which was unintended. I tried setting a different port for each app but the same situation kept ocurring. I don't understand why since I thought that by setting two different ports, two indepentent clusters would be initialized, and, therefore, two jobs could run simultaneously.
Error message
Job request failed Server error java.lang.IllegalArgumentException:
Error: Job is missing
Request: GET /3/Jobs/$0301c0a8f00232d4ffffffff$_911222b9c2e4404c31191c0d3ffd44c6, will retry after 3s.
Alternatively, I moved H2O to a container of its own and I'm trying to set a multi-node cluster so that each app run on a node. Bellow is the Dockerfile and entrypoint.sh file used to start the cluster:
Dockerfile
########################################################################
# Dockerfile for Oracle JDK 8 on Ubuntu 16.04
########################################################################
# pull base image
FROM ubuntu:16.04
RUN \
echo 'DPkg::Post-Invoke {"/bin/rm -f /var/cache/apt/archives/*.deb || true";};' | tee /etc/apt/apt.conf.d/no-cache && \
echo "deb http://mirror.math.princeton.edu/pub/ubuntu xenial main universe" >> /etc/apt/sources.list && \
apt-get update -q -y && \
apt-get dist-upgrade -y && \
apt-get clean && \
rm -rf /var/cache/apt/* && \
DEBIAN_FRONTEND=noninteractive apt-get install -y wget unzip openjdk-8-jdk python-pip python-sklearn python-pandas python-numpy python-matplotlib software-properties-common python-software-properties && \
apt-get clean
# Fetch h2o
ENV H2O_RELEASE rel-zipf
ENV H2O_VERSION 3.32.1.7
RUN \
wget http://h2o-release.s3.amazonaws.com/h2o/${H2O_RELEASE}/$(echo $H2O_VERSION | cut -d "." -f4)/h2o-${H2O_VERSION}.zip -O /opt/h2o.zip && \
unzip -d /opt /opt/h2o.zip && \
rm /opt/h2o.zip && \
cd /opt && \
cd `find . -name 'h2o.jar' | sed 's/.\///;s/\/h2o.jar//g'` && \
cp h2o.jar /opt && \
/usr/bin/pip install `find . -name "*.whl"`
# Define the working directory
WORKDIR \
/home/h2o
EXPOSE 54321-54326
# Define entrypoint
COPY ./bin/entrypoint.sh ./entrypoint.sh
RUN chmod +x entrypoint.sh
ENTRYPOINT ["./entrypoint.sh"]
entrypoint.sh
#!/bin/bash
# Entrypoint script.
set -e
d=`dirname $0`
# Use 90% of RAM for H2O, 30% for each node.
memTotalKb=`cat /proc/meminfo | grep MemTotal | sed 's/MemTotal:[ \t]*//' | sed 's/ kB//'`
memTotalMb=$[ $memTotalKb / 1024 ]
tmp=$[ $memTotalMb * 30 ]
xmxMb=$[ $tmp / 100 ]
# Use all 36 cores for H2O, 12 for each node.
totalCores=`lscpu | grep "^CPU(s)" | sed 's/CPU(s):[ \t]*//'`
nthreads=$[ $totalCores / 3 ]
# First try running java.
java -version
# Start 2 H2O nodes in the background
nohup java -Xmx${xmxMb}m -jar /opt/h2o.jar -nthreads ${nthreads} -name ${H2O_CLUSTER_NAME} -port ${H2O_NODE_2_PORT} &
nohup java -Xmx${xmxMb}m -jar /opt/h2o.jar -nthreads ${nthreads} -name ${H2O_CLUSTER_NAME} -port ${H2O_NODE_3_PORT} &
# Start the 3rd node.
java -Xmx${xmxMb}m -jar /opt/h2o.jar -nthreads ${nthreads} -name ${H2O_CLUSTER_NAME} -port ${H2O_NODE_1_PORT}
As can be seen, I start a total of 3 nodes (the webapp can request 2 operations at once), each on a different port (ports 54321, 54323, and 54325. The IP is the same), I setted the memory to 30% of the total memory for each node and the nthreads to a third of the available cores (36 total, 12 for each node). The cluster start fine with 3 nodes, however, contrary to what I expected, each node has all the 36 cores instead of 12 (giving a total of 108), as shown in the image bellow, leading to the same error I had before.
I looked other stackoverflow post as well as H2O documentation but couldn't find anything that works for me. How can I configure H2O so that I can have multiple jobs running simultaneously from different applications?