Document Actions
btl_tcp_endpoint...
This Error Reference is intended for:
Any audience.
Error message:
[c19-15.local][[18870,1],5][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.254.208 failed: Connection refused (111) connect() to 192.168.255.238 failed: Connection refused (111) [c43-16.local:11900] 7 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics [c43-16.local:11900] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
This is probably due to a weakness of the system when the job is assigned to nodes with and without infiniband at the same time.
Until this is fixed on a general basis, the problem can be avoided by not trying to use the infiniband network also on the nodes where it is available, by adding the "--mca btl ^openib --mca btl_tcp_if_include eth0" option to mpirun.
For example:
mpirun --mca btl ^openib --mca btl_tcp_if_include eth0 -np 256 MyProg.exe
Another alternative is to require to avoid being assigned infiniband nodes through the PBS option ":gige", for example
#PBS -lnodes=2:ppn=8:gige
Remark
You should also consider using the infiniband network, since this may significantly improve the performance of your code. If you use only infiniband nodes, you will not get the "MPI_INIT" error.
You can demand to use only infiniband nodes, with the PBS option ":ib", for example
#PBS -lnodes=2:ppn=8:ib
See also Run script example for Stallo


:et option?
In theses cases, it does not appear to be possible to assign only ethernet connected nodes to the job. Would it be possible to have an option similar to ib but forcing the job to use only ethernet, and to use in the same way in the -lnodes option ?