GASPI/GPI-2 In-Memory Checkpointing

The GPI Checkpointing Library (GPI CP) is an open-source In-Memory Checkpointing Library. It is available on github.

In-memory checkpointing 

  • no I/O overhead
  • benefitos from fast/low latency transport

Playing to full strenght with NVRAM

  • usage of NVRAM automatically includes persistence (non-volatility)
  • multiple parallel target segments (DRAM, NVRAM,..) possible

Interface independent of GASPI/GPI-2

  • can be called from any parallel program e.g. based on MPI

Implementation based on GASPI/GPI-2

  • leveraging fault tolerance features of GASPI/GPI-2
  • exploiting asynchrony to overlap communication and computation time

Proven concept

  • overhead negligible with real life seismic imaging method
  • substantial benefit when recovering from failure

Introduction

Checkpointing is a classical and probably the most often used technique to minimize the effect of failures when running a parallel program. It simply consists of saving a snapshot of a program’s state or produced data. An application can use such a snapshot to recover from a failure by continuing the execution from the saved point. To avoid the large overhead of I/O when writing to  persistent storage – an often pointed drawback of checkpointing – we opted for an in-memory checkpointing where the snapshot of a process is saved in the memory of a neighboring node. In case  of failure, a spare node can fetch the checkpoint from the neighbor of the failing process (mirror) and continue the work from that point. Another important aspect and a consequence of the GASPI  architecture, is that when NVRAM becomes widely available such an approach includes persistence (non-volatility) automatically.
GASPI memory segments were conceived to represent any sort of available memory. NVRAM is one such sort. Our approach provides a simple interface for application developers, built on top of GASPI/GPI. This keeps the GASPI/GPI core lean and checkpointing as a separate option.

Interface

The current interface (subject to additions and changes) consists of different calls to initialize and finalize the checkpointing infrastructure and perform and restore a checkpoint. You can view it on github.

Application view

The application decides when it is more reasonable to perform a checkpoint. It must also detect a fault and enter a recovery process. In this recovery process, a spare node takes the place of the failed one, the last checkpoint is read and the application can continue its execution. Note that it is assumed that the application starts with a set of spare nodes. Spare nodes are a set of nodes that initially are idle while the remaining active nodes execute the application. Our envisioned application structure using the GASPI in-memory checkpointing interface is the following:

int main()
{
    SUCCESS_OR_DIE (gaspi_proc_init (...));
    gaspi_size_t const size = ...;
    SUCCESS_OR_DIE (gaspi_segment_create (segment_id_checkpoint, size, ...));
   
    gaspi_cp_description_t checkpoint_description = GASPI_CP_DESCRIPTION_INITIALIZER();
    SUCCESS_OR_DIE (gpi_cp_init (segment_id_checkpoint, gaspi_offset_t (0),
                                      size, gaspi_queue_id_t (4), cp_policy,
                                      active_group, &checkpoint_description, timeout));
    for (iteration)
    {
        if (checkpoint_this_iteration)
        {
            SUCCESS_OR_DIE (gpi_cp_commit (&checkpoint_description, timeout));
            // setup segment with id segment_id_checkpoint -> application specific
            // store to be checkpointed data in segment_id_checkpoint
            // e.g.
            // memcpy (ptr (segment_id_checkpoint), ptr (segment_id_work), size – 8);
            // memcpy (ptr (segment_id_checkpoint) + size - 8, &state, 8);
            SUCCESS_OR_DIE (gpi_cp_start (&checkpoint_description, timeout));
        }
        ...
    }
    gpi_cp_finalize (&checkpoint_description, timeout);
}

Initialization Phase

The initialization of the checkpoint infrastructure is done by invoking checkpoint_init. Currently and following the GASPI semantic, the application must provide a segment, offset and size where  the data to be saved will be placed by the application. This is application specific. Moreover, a checkpoint policy and group must be given. The group is a GASPI construct and corresponds to the group of ranks that will be active and will perform checkpoints. The checkpoint policy corresponds to the selection of the neighbor where the mirrored data will be placed. One example is a policy that corresponds to a ring topology in one direction, where the mirror is always located on the right or on the left of a node. Other topologies are also possible and the checkpoint policy object can be chosen by the application. We have also foreseen that the checkpoint object could be created and given by the user, providing maximum flexibility.

After initialization, a checkpoint description is returned. This checkpoint description is then used to invoke other routines. One important consequence of this initialization design is that several snapshots are possible by simply invoking the initialization multiple times. This way, an application can have different checkpoints with different policies, different priorities and redundancy or  persistence levels.

Check Pointing

Our in-memory checkpoint approach follows what we call an asynchronous, coordinated checkpointing approach. It is coordinated because at some point in time we ensure global consistency of a  snapshot by means of a collective operation (gaspi_barrier). In other words, all active processes ensure that they have one particular snapshot that is consistent on all processes. It is asynchronous because we take advantage of GASPI/GPI communication. When a checkpoint is performed, the data is transferred using asynchronous communication to the mirror.

Performing a checkpoint is a two step procedure: the application must a) start a checkpoint and b) commit the checkpoint. Again, this split-phase semantic matches that of GASPI communication  and aims at hiding the costs of communication required by mirroring. Starting a checkpoint (checkpoint_start) initiates the copy of checkpoint data to the neighboring mirror. All the details  required to post that communication are included in the checkpoint description object.

Committing the checkpoint (checkpoint_commit) is a global operation and ensures the completion of a previously started checkpoint operation on all nodes. At this point, a valid snapshot exists to which the application can return to. Being a global operation and following the GASPI semantic, the commit operation has a timeout to avoid blocking. Moreover, the timeout parameter also allows  deciding whether to do a new checkpoint. It can be used as a test flag to check if the previously initiated commit is finished and take the decision to start a new one.

Fault Detection

The detection of faults is orthogonal to checkpoints and currently has to be programmed by the application. GASPI already provides mechanisms for that: timeouts and the error state vector. In the current GPI implementation, the hardware fault of a node can be detected locally by a process running on a node requesting communication to the faulty node. If a communication request is erroneous or returned timeout, the process can check the error state vector. The error state vector is set after every non-local operation and can be used to detect failures on the remote processes.  Each rank can either have a state of GASPI_STATE_HEALTHY or GASPI_STATE_CORRUPT. This error state vector can be queried by the application (using gaspi_state_vec_get) to determine  the state of a remote partner in case of timeout or error.
If a problem is detected, the fault needs to be communicated to and acknowledged by all other running processes. After fault detection all of the remaining and healthy processes can enter consistently the recovery process.

Recovery

Once a fault is detected, the remaining processes must enter a recovery step. Such recovery step will generally involve 3 actions:

  1. Bring-up of spare node(s) to overtake the place of the failed one(s).
  2. Creation of the new group of active processes.
  3. Restore data from consistent checkpoint.

The first 2 actions must be programmed in the application although in principle it should be possible to perform this in a more automatic way. The third action corresponds to the  checkpoint_restore call of our in-memory checkpointing interface.
The checkpoint_restore call is symmetric to the checkpoint_init. The differences are that a new group of active processes is provided and the checkpoint description object is updated on the set of survivor nodes and created anew for the new joining (spare) nodes. Moreover, a valid snapshot will be retrieved from the corresponding mirrors and when the procedure returns successfully the data will be available in the provided memory segment. After this the application can continue from that point.