3
votes

I've been at it for days but could not solve my problem.

I am running: mpiexec -hostfile ~/machines -nolocal -pernode mkdir -p $dstpath where $dstpath points to current directory and "machines" is a file containing: node01 node02 node03 node04

This is the error output:

Failed to parse XML input with the minimalistic parser. If it was not
generated by hwloc, try enabling full XML support with libxml2.
[node01:06177] [[6421,0],0] ORTE_ERROR_LOG: Error in file base/plm_base_launch_support.c at line 891
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
[node01:06177] 1 more process has sent help message help-errmgr-base.txt / failed-daemon-launch
[node01:06177] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Failed to parse XML input with the minimalistic parser. If it was not
generated by hwloc, try enabling full XML support with libxml2.
[node01:06181] [[6417,0],0] ORTE_ERROR_LOG: Error in file base/plm_base_launch_support.c at line 891

I have 4 machines, node01 to node04. In order to log into these 4 nodes, I have to first log in to node00. I am trying to run some distributed graph functions. The graph software is installed in node01 and is supposed to be synchronised to the other nodes using mpiexec.

What I've done:

  1. Made sure all passwordless login are setup, every machine can ssh to any other machine with no issues.

  2. Have a hostfile in the home directory.

  3. echo $PATH gives /home/myhome/bin:/home/myhome/.local/bin:/usr/include/openmpi:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

  4. echo $LD_LIBRARY_PATH gives /usr/lib/openmpi/lib

This has previously worked before, but it just suddenly started giving these errors. I got my administrator to install fresh machines but it still gave such errors. I've tried doing it one node at a time but it gave the same errors. I'm not entirely familiar with command line at all so please give me some suggestions. I've tried reinstalling OpenMPI from source and from sudo apt-get install openmpi-bin. I'm on Ubuntu 16.04 LTS.

1
Did you ever get this resolved? I am having like the same exact problem. - Jon Deaton
I had the same problem, it disappeared when I use IP numbers instead of node names in the machinefile. Don't really get it, hostnames are defined in /etc/hosts on all the nodes. - mok0

1 Answers

-2
votes

You should focus on fixing:

Failed to parse XML input with the minimalistic parser. If it was not generated by hwloc, try enabling full XML support with libxml2. [node01:06177] [[6421,0],0] ORTE_ERROR_LOG: Error in file base/plm_base_launch_support.c at line 891