Personal tools
You are here: Home UiT Stallo Documentation Error References btl_tcp_endpoint...
Document Actions

btl_tcp_endpoint...


This Error Reference is intended for: Any audience.

Program crashes from start. "...btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect..." This can happen when your job has been assigned nodes with and without infiniband at the same time.

 

Error message:

 

[c19-15.local][[18870,1],5][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.254.208 failed: Connection refused (111)
connect() to 192.168.255.238 failed: Connection refused (111)
[c43-16.local:11900] 7 more processes have sent help message
help-mpi-btl-base.txt / btl:no-nics
[c43-16.local:11900] Set MCA parameter "orte_base_help_aggregate" to 0
to see all help / error messages

 

This is probably due to a weakness of the system when the job is assigned to nodes with and without infiniband at the same time.

Until this is fixed on a general basis, the problem  can be avoided by not trying to use the infiniband network also on the nodes where it is available, by adding the "--mca btl ^openib --mca btl_tcp_if_include eth0" option to mpirun.

For example:

mpirun  --mca btl ^openib --mca btl_tcp_if_include eth0 -np 256 MyProg.exe

 

Another alternative is to require to avoid being assigned infiniband nodes through the PBS option ":gige", for example  

#PBS -lnodes=2:ppn=8:gige

 

Remark

You should also consider using the infiniband network, since this may significantly improve the performance of your code. If you use only infiniband nodes, you will not get the "MPI_INIT" error.

You can demand to use only infiniband nodes, with the PBS option ":ib", for example

#PBS -lnodes=2:ppn=8:ib

  See also Run script example for Stallo

by Peter Wind last modified Feb 16, 2010 10:50 PM Notur

:et option?

Posted by Maxime Guillaume at Dec 02, 2008 08:46 PM
In some cases, the mpiexec command is managed by the program itself through parallelized scripts (I take Turbomole for example, in which the program decides by himself how to launch the different executables).

In theses cases, it does not appear to be possible to assign only ethernet connected nodes to the job. Would it be possible to have an option similar to ib but forcing the job to use only ethernet, and to use in the same way in the -lnodes option ?


:gige

Posted by Peter Wind at Dec 03, 2008 01:40 PM
You can use ":gige". I have included your remark in the main text.
We are also working on a more permanent and user friendly solution.

setting openmpi params through enviroment

Posted by Roy Dragseth at Dec 10, 2008 02:17 PM
If you cannot get direct access to the mpirun commandline you can use environment variables. In this case to turn off infiniband:

export OMPI_MCA_btl=self,shm,tcp

mpirun ............

this should work unless the wrapper scripts completely destroys the environment.