Fault Tolerance

Melissa includes a robust system for fault tolerance and checkpointing the various components of an experiment. Users can activate fault tolerance by setting:

{
    "launcher_config":{
        "fault_tolerance": true
    }

}
in their config file. When fault_tolerance is activated, individual clients are restarted by the launcher when/if a client is killed for any reason, including if:
  • two subsequent messages from a single client exceed study_options.simulation_timeout seconds
  • a client fails due to scheduling problems on the supercomputer
  • a client hits a wall-time and is killed by the scheduler

Meanwhile, the launcher will also detect if the server has failed for any reason, and will relaunch it. The server may relaunch after:

  • the server hasn't sent any messages to the launcher for launcher_config.server_timeout seconds
  • the server fails due to scheduling problems on the supercomputer
  • the server hits a wall-time and is killed by the scheduler

Checkpointing the server

The server contains important state information that needs to be checkpointed for a seamless restart after failure. The state information includes:

  • completed clients and the unfinished clients
  • statistics fields (for Sensitivity Analyses)
  • buffer (reservoir) object (for Deep Learning studies)
  • Neural Network weights/biases, optimizer, batch progress (for Deep Learning Studies)

Checkpointing is performed by calling self.checkpoint() inside the server. For sensitivity analyses, this occurs based on a configuration interval setting:

{
    "sa_config":{
        "checkpoint_interval": 100
    }

}

Which indicates that the server will checkpoint itself after every 100 samples received from the clients. Meanwhile, for deep learning experiments, the checkpointing is performed by the user calling self.checkpoint(batch_number) in their custom server. For example, in the examples/heat-pde/heat-pde-dl/heatpde_dl_server.py, the server is checkpointed right after validation, which means the server gets checkpointed every n_batches_update batches. The user simply calls:

self.checkpoint(batch_number)

where batch is the current batch number.

Launcher failure

When the launcher fails, the simulation must be restarted manually. In this case, the user can elect to continue the simulation where they left off by activating load_from_checkpoint in their launcher_config

{
    "launcher_config":{
        "load_from_checkpoint": true
    }

}

The launcher will then ensure that the server is instantiated from the checkpoint information, exactly the same way it would be initialized if the server had failed.