GASPI is supposed to be failure tolerant. What does this entail?

admin Avatar

GASPI is failure tolerant in the sense that all non-local operations feature a timeout with a defined exit status. GASPI maintains a status vector with the current node status. If an error occurs and a node is lost, it is possible to either reduce the GASPI node set or to request a new node, e.g. from a pool of spare nodes. Application-level steps are nevertheless necessary, e.g. actual initialization from a previously written checkpoint.  GASPI does not feature a fully automatic handling of failure tolerance via, for example, checkpoint/restart; GASPI does provide a minimal set of the required low-level functions.