Personal tools
You are here: Home UiT Stallo Documentation FAQs My mpi job crashes with a retry limit exceeded error.
Document Actions

My mpi job crashes with a retry limit exceeded error.

Up to Table of Contents

Sometimes a mpi job will fail saying that its retry limit is exceeded.

If you get something like this from an mpi job:

[0,1,132][btl_openib_component.c:1328:btl_openib_component_progress] from c12-3 to: c13-3 error polling
HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id   57073528 opcode 42

This means that you have hit a problem with the communication over the infiniband fabric between two nodes. This may or may not be related to problems with the infiniband network.

To work around this problem you can try to avoid the problematic node(s), in the error message above it seems to be the receiver, c13-3, that is causing the problem so in this case I would run a dummy job on that node and try to resubmit the failing job. See here how to run a dummy job on a specific node.

If this does not help, send us a problem report.

by Roy Dragseth last modified May 28, 2008 09:12 PM Notur