2
votes

I am trying to run a simple MPI program(multiple array addition), it runs perfectly in my PC but simply hangs or shows the following error in the cluster. I am using open mpi and the following command to execute

Netwok Config of the cluster(master&node1)

        MASTER

eth0 Link encap:Ethernet HWaddr 00:22:19:A4:52:74
inet addr:10.1.1.1 Bcast:10.1.255.255 Mask:255.255.0.0 inet6 addr: fe80::222:19ff:fea4:5274/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:16914 errors:0 dropped:0 overruns:0 frame:0 TX packets:7183 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:2050581 (1.9 MiB) TX bytes:981632 (958.6 KiB)

eth1 Link encap:Ethernet HWaddr 00:22:19:A4:52:76
inet addr:192.168.41.203 Bcast:192.168.41.255 Mask:255.255.255.0 inet6 addr: fe80::222:19ff:fea4:5276/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:701 errors:0 dropped:0 overruns:0 frame:0 TX packets:228 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:75457 (73.6 KiB) TX bytes:25295 (24.7 KiB)

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:88362 errors:0 dropped:0 overruns:0 frame:0 TX packets:88362 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:21529504 (20.5 MiB) TX bytes:21529504 (20.5 MiB)

peth0 Link encap:Ethernet HWaddr 00:22:19:A4:52:74
inet6 addr: fe80::222:19ff:fea4:5274/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:17175 errors:0 dropped:0 overruns:0 frame:0 TX packets:7257 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:2373869 (2.2 MiB) TX bytes:1020320 (996.4 KiB) Interrupt:16 Memory:da000000-da012800

peth1 Link encap:Ethernet HWaddr 00:22:19:A4:52:76
inet6 addr: fe80::222:19ff:fea4:5276/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1112 errors:0 dropped:0 overruns:0 frame:0 TX packets:302 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:168837 (164.8 KiB) TX bytes:33241 (32.4 KiB) Interrupt:16 Memory:d6000000-d6012800

virbr0 Link encap:Ethernet HWaddr 52:54:00:E3:80:BC
inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

            NODE 1

eth0 Link encap:Ethernet HWaddr 00:22:19:53:42:C6
inet addr:10.1.255.253 Bcast:10.1.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:16559 errors:0 dropped:0 overruns:0 frame:0 TX packets:7299 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:1898811 (1.8 MiB) TX bytes:1056294 (1.0 MiB)

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:25 errors:0 dropped:0 overruns:0 frame:0 TX packets:25 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:3114 (3.0 KiB) TX bytes:3114 (3.0 KiB)

peth0 Link encap:Ethernet HWaddr 00:22:19:53:42:C6
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:16913 errors:0 dropped:0 overruns:0 frame:0 TX packets:7276 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:2221627 (2.1 MiB) TX bytes:1076708 (1.0 MiB) Interrupt:16 Memory:f8000000-f8012800

virbr0 Link encap:Ethernet HWaddr 52:54:00:E7:E5:FF
inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

Error

mpirun -machinefile machine -np 4 ./query
error code:
[[22877,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.122.1 failed: Connection refused (111)

Code

#include    <mpi.h>
#include    <stdio.h>
#include    <stdlib.h>
#include    <string.h>
#define     group           MPI_COMM_WORLD
#define     root            0
#define     size            100

int main(int argc,char *argv[])
{
int no_tasks,task_id,i;
MPI_Init(&argc,&argv);
MPI_Comm_size(group,&no_tasks);
MPI_Comm_rank(group,&task_id);
int arr1[size],arr2[size],local1[size],local2[size];
if(task_id==root)
{
    for(i=0;i<size;i++)
    {
        arr1[i]=arr2[i]=i;
    }
}
MPI_Scatter(arr1,size/no_tasks,MPI_INT,local1,size/no_tasks,MPI_INT,root,group);
MPI_Scatter(arr2,size/no_tasks,MPI_INT,local2,size/no_tasks,MPI_INT,root,group);
for(i=0;i<size/no_tasks;i++)
{
    local1[i]+=local2[i];
}
MPI_Gather(local1,size/no_tasks,MPI_INT,arr1,size/no_tasks,MPI_INT,root,group);
if(task_id==root)
{       
    printf("The Array Sum Is\n");
    for(i=0;i<size;i++)
    {
        printf("%d  ",arr1[i]);
    }
}
MPI_Finalize();
return 0;
}
1
Connection refused is from the network, so it sounds like the configuration of the cluster is the problem.Fred
Possibly firewall issue? Like Fred says, it's a network issue, not an mpi issue as such.Mats Petersson
See my answer to this question. You question might be a duplicate of that one.Hristo Iliev
@Fred I have added the network config of the cluster, I am working on. master and first nodes IPV6 addresses are on the same subnet could that be a problemJustin Joseph
@MatsPetersson single scatter works between master and slaves ,so firewall allows a single access for sure + there is a passwordless SSH connetion between themJustin Joseph

1 Answers

8
votes

Tell Open MPI not to use the virtual bridge interface virbr0 interface for sending messages over TCP/IP. Or better tell it to only use eth0 for the purpose:

$ mpiexec --mca btl_tcp_if_include eth0 ...

This comes from the greedy behaviour of Open MPI's tcp BTL component that transmits messages using TCP/IP. It tries to use all of the available network interfaces that are up on each node in order to maximise the data bandwidth. Both nodes have virbr0 configured with the same subnet address. Open MPI falls to recognise that both addresses are equal, but since the subnets match, it assumes that it should be able to talk over virbr0. So process A is trying to send a message to process B, which resides on the other node. Process B listens on port P and process A knows this, so it tries to connect to 192.168.122.1:P. But this is actually the address given to the virbr0 interface on the node where process A is, so the node tries to talk to itself on a non-existent port, hence the "connection refused" error.