The problem with opening insecure ports
Note: this is a general problem, not one limited to Hadoop specifically.
The current solution of opening up ports is not a good idea because Hadoop status pages are served via HTTP (not HTTPS) which means that they are served in plain text and hence, anyone on the Internet can also access your instance and view or take control of your Hadoop jobs or cluster or the data they contain.
Solution alternatives
Instead, you should be sending all traffic over an encrypted channel, either HTTPS/SSL or by using an SSH tunnel and send your browser traffic over that.
Hadoop at this time does not serve HTTPS to my knowledge, so what you can do is create an SSH tunnel and browse via that secure tunnel.
The benefits of this solution are:
- it's secure: all communication between your browser and the VM instance is over your SSH connection, so even if the connection is using HTTP instead of HTTPS, it's still secure from external users
- you can connect to hostnames (i.e., your VM names directly), as if they were on your local network, e.g.,
http://my-host:5392
- you can connect to any port on any host, without having to open each and every port individually
Complete guide to connecting securely to GCE VMs
See the "Securely Connecting to VM Instances" guide for more details beyond SOCKS proxy guide below, including firewalls, HTTPS and SSL, port forwarding over SSH, SOCKS proxy over SSH, bastion hosts, VPNs, NATs, etc.
Connecting securely via SSH tunnel + SOCKS proxy
The way to do this is to set up a SOCKS proxy which will use an SSH tunnel to secure your communication with the Hadoop cluster on GCE. You can either use the full script or create your own as follows:
#!/bin/bash
# Modify these variables to match your deployment.
export PROJECT="curious-lemming-42" # Google Cloud Platform Project
export ZONE="us-central-1" # zone of Hadoop cluster
export PORT="9000" # port on local machine to run proxy;
# just choose an open port
export SERVER="my-instance" # any VM instance in the cluster
# This command starts the SOCKS proxy on $PORT.
gcloud compute ssh \
--project="${PROJECT}" \
--zone="${ZONE}" \
--ssh-flag="-D" \
--ssh-flag="${PORT}" \
--ssh-flag="-N" \
"${SERVER}"
Open a new terminal on your local machine (not on a GCE VM) and run this script there. While this script is running, you will have a secure proxy set up to your Hadoop cluster over SSH.
Then, assuming you're using Google Chrome, you can use this script, also on your local machine, not on a GCE VM, to connect securely to your Hadoop cluster:
#!/bin/bash
# This port must match the port in the other script above.
declare -r PORT="9000"
# Create a directory for the proxy profile to separate it from the others.
# You can change this directory if you wish.
declare -r CHROME_PROXY_PROFILE="${HOME}/chrome-proxy-profile"
if ! [ -d "${CHROME_PROXY_PROFILE}" ]; then
mkdir -p "${CHROME_PROXY_PROFILE}"
fi
# Run a new instance of Chrome using the custom proxy profile.
declare -r OS_NAME="$(uname -s)"
if [[ "${OS_NAME}" == "Linux" ]]; then
/usr/bin/google-chrome \
--user-data-dir="${CHROME_PROXY_PROFILE}" \
--proxy-server="socks5://localhost:${PORT}"
elif [[ "${OS_NAME}" == "Darwin" ]]; then
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" \
--user-data-dir="${CHROME_PROXY_PROFILE}" \
--proxy-server="socks5://localhost:${PORT}"
else
echo "Unrecognized OS: ${OS_NAME}" >&2
exit 1
fi
If you would like to set up Firefox, see these directions which cannot be scripted at this time.
Background and details on how and why this works
You can read more about SSH tunneling, what it is and how it works from these sources: