Remote access to HDFS on Kubernetes

Question

I am trying to setup HDFS on minikube (for now) and later on a DEV kubernetes cluster so I can use it with Spark. I want Spark to run locally on my machine so I can run in debug mode during development so it should have access to my HDFS on K8s.

I have already set up 1 namenode deployment and a datanode statefulset (3 replicas) and those work fine when I am using HDFS from within the cluster. I am using a headless service for the datanodes and a cluster-ip service for the namenode.

The problem starts when I am trying to expose hdfs. I was thinking of using an ingress for that but that only exposes port 80 outside of the cluster and maps paths to different services inside the cluster which is not what I'm looking for. As far as I understand, my local spark jobs (or hdfs client) talk to the namenode which replies with an address for each block of data. That address though is something like 172.17.0.x:50010 and of course my local machine can't see those.

Is there any way I make this work? Thanks in advance!

Might help if you set dfs.client.use.datanode.hostname to true. Then you'd at least maybe get a named pod address rather than some floating IP — OneCricketeer
That's already set but I still get the IP outside of the cluster. But even if I got the hostname, any idea how I could use it outside of the cluster? I still wouldn't be able to access hostname:50010 — mousi
You would need some ingress/load balancer proxy, and setup external clients to know how to resolve the internal pod addresses. It's not specifically a Hadoop problem — OneCricketeer

Rico Rico · Accepted Answer · 2018-11-04T04:18:03

I know this question is about just getting it to run in a dev environment, but HDFS is very much a work in progress on K8s, so I wouldn't by any means run it in production (as of this writing). It's quite tricky to get it working on a container orchestration system because:

You are talking about a lot of data and a lot of nodes (namenodes/datanodes) that are not meant to start/stop in different places in your cluster.
You have the risk of having a constantly unbalanced cluster if you are not pinning your namenodes/datanodes to a K8s node (which defeats the purpose of having a container orchestration system)
If you run your namenodes in HA mode and it for any reason your namenodes die and restart you run the risk of corrupting the namenode metadata which would make you lose all your data. It's also risky if you have a single node and you don't pin it to a K8s node.
You can't scale up and down easily without running in an unbalanced cluster. Running an unbalanced cluster defeats one of the main purposes of HDFS.

If you look at DC/OS they were able to make it work on their platform, so that may give you some guidance.

In K8s you basically need to create services for all your namenode ports and all your datanode ports. Your client needs to be able to find every namenode and datanode so that it can read/write from them. Also the some ports cannot go through an Ingress because they are layer 4 ports (TCP) for example the IPC port 8020 on the namenode and 50020 on the datanodes.

Hope it helps!

Remote access to HDFS on Kubernetes

1 Answers