I have a simple MPICH program in which processes send & receive messages from each other in a Ring order.
I've setup to 2 identical virtual machine, and made sure network is working fine. I've tested a simple MPICH program both machines and it works fine.
The problem arises when I try to communicate between processes on different machines like the above program. I'm getting the following error:
Fatal error in MPI_Send: A process has failed, error stack:
MPI_Send(171)...............: MPI_Send(buf=0xbfed8c08, count=1, MPI_INT, dest=1,
tag=1, MPI_COMM_WORLD) failed
MPID_nem_tcp_connpoll(1826): Communication error with rank 1: Connection refused
- SSH is passwordless & works fine on both sides.
/etc/hosts
is configured properly.- Firewall is disabled on both machines.
- Configured NFS Client/Server and shared a directory between them. (According to this)
- Tried both MPICH & OpenMPI with Hydra
mpiexec -f hosts -n 4 ./myapp
which I think uses shh under the hood. – atoMerz