openshift 3.11 storageos networking issue

Question

I've created an openshift 3.11 3 node cluster, 2 of which are compute nodes. I've installed storageos on this cluster. One of the compute nodes seems fine with the storageos installation, however the 2nd compute node can't reach the 1st node. It appears that the error is routing related.

the 2nd node will not route to the 1st node it appears.

[root@cortado-o1 standard]# oc get pod -n storageos
NAME              READY     STATUS    RESTARTS   AGE
storageos-47qgc   1/1       Running   0          6m
storageos-6bqqp   0/1       Running   3          7m

[root@cortado-o2 ~]# netstat -na | grep 5705
tcp6       0      0 :::5705     

[root@cortado-o3 ~]# netstat -na | grep 5705
tcp        0      0 192.168.0.101:43588     192.168.0.101:5705      TIME_WAIT  
tcp        0      0 192.168.0.101:43548     192.168.0.101:5705      TIME_WAIT  
tcp        0      0 192.168.0.101:43522     192.168.0.101:5705      TIME_WAIT  
tcp        0      0 192.168.0.101:43458     192.168.0.101:5705      TIME_WAIT  
tcp        0      0 192.168.0.101:43628     192.168.0.101:5705      TIME_WAIT  
tcp        0      0 192.168.0.101:43602     192.168.0.101:5705      TIME_WAIT  
tcp        0      0 192.168.0.101:43562     192.168.0.101:5705      TIME_WAIT  
tcp        0      0 192.168.0.101:43502     192.168.0.101:5705      TIME_WAIT  
tcp        0      0 192.168.0.101:43476     192.168.0.101:5705      TIME_WAIT  
tcp        0      0 192.168.0.101:43412     192.168.0.101:5705      TIME_WAIT  
tcp        0      0 192.168.0.101:43430     192.168.0.101:5705      TIME_WAIT  
tcp6       0      0 :::5705                 :::*                    LISTEN   

[root@cortado-o3 ~]# !nc
nc 192.168.0.102 5705
Ncat: No route to host.
[root@cortado-o3 ~]# hostname --ip-address
192.168.0.101

time="2018-11-13T04:24:38Z" level=error msg="failed to join existing cluster" action=create category=etcd endpoint="192.168.0.102,192.168.0.101" error="Get http://192.168.0.102:5705/v1/members: dial tcp 192.168.0.102:5705: connect: no route to host" module=cp
time="2018-11-13T04:24:38Z" level=info msg="not first cluster node, joining first node" action=create address=192.168.0.101 category=etcd host=cortado-o3 module=cp target=192.168.0.101
time="2018-11-13T04:24:38Z" level=error msg="failed to join existing cluster" action=create category=etcd endpoint="192.168.0.102,192.168.0.101" error="503 Service Unavailable" module=cp
time="2018-11-13T04:24:38Z" level=info msg="retrying cluster join in 5 seconds..." action=create category=etcd module=cp

any suggestions? many thanks.

Ferran Arau Castell Ferran Arau Castell · Accepted Answer · 2018-11-23T14:58:42

I can see on your netstat output that StorageOS is bound to the port, not that they can communicate. In fact the Ncat shows that there is no route to host, so they can't connect. StorageOS needs to be able to communicate among its nodes.

The StorageOS docs have a reference about the prerequisites of the ports and how to open them. https://docs.storageos.com/docs/prerequisites/firewalls

It depends on your OpenShift installation if you use ufw, firewalld or straight ip tables.

For ufw try this:

ufw default allow outgoing
ufw allow 5701:5711/tcp
ufw allow 5711/udp

For firewalld try this:

firewall-cmd --permanent  --new-service=storageos
firewall-cmd --permanent  --service=storageos --add-port=5700-5800/tcp
firewall-cmd --add-service=storageos  --zone=public --permanent
firewall-cmd --reload

For straight iptables:

# Inbound traffic
iptables -I INPUT -i lo -m comment --comment 'Permit loopback traffic' -j ACCEPT
iptables -I INPUT -m state --state ESTABLISHED,RELATED -m comment --comment 'Permit established traffic' -j ACCEPT
iptables -A INPUT -p tcp --dport 5701:5711 -m comment --comment 'StorageOS' -j ACCEPT
iptables -A INPUT -p udp --dport 5711 -m comment --comment 'StorageOS' -j ACCEPT

# Outbound traffic
iptables -I OUTPUT -o lo -m comment --comment 'Permit loopback traffic' -j ACCEPT
iptables -I OUTPUT -d 0.0.0.0/0 -m comment --comment 'Permit outbound traffic' -j ACCEPT

Check also the troubleshooting page of storageos for this particular issue. https://docs.storageos.com/docs/platforms/openshift/troubleshoot/install#peer-discovery---networking

In addition, less than 3 node cluster is not supported. You can have 1 node for testing or 3+. But having 2 nodes makes impossible to ensure quorum in a distributed environment unless you use StorageOS pointing the kv store to a external etcd.

openshift 3.11 storageos networking issue

1 Answers