Document Actions
My mpi job crashes with a retry limit exceeded error.
Up to Table of Contents
If you get something like this from an mpi job:
[0,1,132][btl_openib_component.c:1328:btl_openib_component_progress] from c12-3 to: c13-3 error polling HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 57073528 opcode 42
This means that you have hit a problem with the communication over the infiniband fabric between two nodes. This may or may not be related to problems with the infiniband network.
To work around this problem you can try to avoid the problematic node(s), in the error message above it seems to be the receiver, c13-3, that is causing the problem so in this case I would run a dummy job on that node and try to resubmit the failing job. See here how to run a dummy job on a specific node.
If this does not help, send us a problem report.

