Distribute resources accross H2O nodes in a multi-node cluster

Question

I have 2 docker containers running my webapp and my machine learning application, both using h2o. Initially, I had both calling h2o.init() and pointing to the same IP:PORT, consequently a single h2o cluster with one node was initialized.

Consider that I have a model already trained and now I'm training a second one. During this training process, if the webapp made a call to the h2o cluster (e.g., requesting a predict from the first model), it would kill the training process (error message bellow), which was unintended. I tried setting a different port for each app but the same situation kept ocurring. I don't understand why since I thought that by setting two different ports, two indepentent clusters would be initialized, and, therefore, two jobs could run simultaneously.

Error message

Job request failed Server error java.lang.IllegalArgumentException:
      Error: Job is missing
      Request: GET /3/Jobs/$0301c0a8f00232d4ffffffff$_911222b9c2e4404c31191c0d3ffd44c6, will retry after 3s.

Alternatively, I moved H2O to a container of its own and I'm trying to set a multi-node cluster so that each app run on a node. Bellow is the Dockerfile and entrypoint.sh file used to start the cluster:

Dockerfile

########################################################################
# Dockerfile for Oracle JDK 8 on Ubuntu 16.04
########################################################################

# pull base image
FROM ubuntu:16.04

RUN \
    echo 'DPkg::Post-Invoke {"/bin/rm -f /var/cache/apt/archives/*.deb || true";};' | tee /etc/apt/apt.conf.d/no-cache && \
    echo "deb http://mirror.math.princeton.edu/pub/ubuntu xenial main universe" >> /etc/apt/sources.list && \
    apt-get update -q -y && \
    apt-get dist-upgrade -y && \
    apt-get clean && \
    rm -rf /var/cache/apt/* && \
    DEBIAN_FRONTEND=noninteractive apt-get install -y wget unzip openjdk-8-jdk python-pip python-sklearn python-pandas python-numpy python-matplotlib software-properties-common python-software-properties && \
    apt-get clean

# Fetch h2o
ENV H2O_RELEASE rel-zipf
ENV H2O_VERSION 3.32.1.7
RUN \
    wget http://h2o-release.s3.amazonaws.com/h2o/${H2O_RELEASE}/$(echo $H2O_VERSION | cut -d "." -f4)/h2o-${H2O_VERSION}.zip -O /opt/h2o.zip && \
    unzip -d /opt /opt/h2o.zip && \
    rm /opt/h2o.zip && \
    cd /opt && \
    cd `find . -name 'h2o.jar' | sed 's/.\///;s/\/h2o.jar//g'` && \
    cp h2o.jar /opt && \
    /usr/bin/pip install `find . -name "*.whl"`

# Define the working directory
WORKDIR \
    /home/h2o

EXPOSE 54321-54326

# Define entrypoint
COPY ./bin/entrypoint.sh ./entrypoint.sh
RUN chmod +x entrypoint.sh
ENTRYPOINT ["./entrypoint.sh"]

entrypoint.sh

#!/bin/bash
# Entrypoint script.

set -e

d=`dirname $0`

# Use 90% of RAM for H2O, 30% for each node.
memTotalKb=`cat /proc/meminfo | grep MemTotal | sed 's/MemTotal:[ \t]*//' | sed 's/ kB//'`
memTotalMb=$[ $memTotalKb / 1024 ]
tmp=$[ $memTotalMb * 30 ]
xmxMb=$[ $tmp / 100 ]

# Use all 36 cores for H2O, 12 for each node.
totalCores=`lscpu | grep "^CPU(s)" | sed 's/CPU(s):[ \t]*//'`
nthreads=$[ $totalCores / 3 ]

# First try running java.
java -version

# Start 2 H2O nodes in the background
nohup java -Xmx${xmxMb}m -jar /opt/h2o.jar -nthreads ${nthreads} -name ${H2O_CLUSTER_NAME} -port ${H2O_NODE_2_PORT} &
nohup java -Xmx${xmxMb}m -jar /opt/h2o.jar -nthreads ${nthreads} -name ${H2O_CLUSTER_NAME} -port ${H2O_NODE_3_PORT} & 

# Start the 3rd node.
java -Xmx${xmxMb}m -jar /opt/h2o.jar -nthreads ${nthreads} -name ${H2O_CLUSTER_NAME} -port ${H2O_NODE_1_PORT}

As can be seen, I start a total of 3 nodes (the webapp can request 2 operations at once), each on a different port (ports 54321, 54323, and 54325. The IP is the same), I setted the memory to 30% of the total memory for each node and the nthreads to a third of the available cores (36 total, 12 for each node). The cluster start fine with 3 nodes, however, contrary to what I expected, each node has all the 36 cores instead of 12 (giving a total of 108), as shown in the image bellow, leading to the same error I had before.

H2O 3-node cluster

I looked other stackoverflow post as well as H2O documentation but couldn't find anything that works for me. How can I configure H2O so that I can have multiple jobs running simultaneously from different applications?

Neema Mashayekhi Neema Mashayekhi · Accepted Answer · 2021-09-20T18:17:18

If you want to launch H2O via CLI with 3 independent nodes, then give them different names:

-name H2O_CLUSTER_NAME_1

-name H2O_CLUSTER_NAME_2

-name H2O_CLUSTER_NAME_3

If you try to give theme the same name, they will try to form a cluster. See here.

Distribute resources accross H2O nodes in a multi-node cluster

1 Answers