Fault Tolerance

Melissa provides a robust fault tolerance and checkpointing system for managing various components of an experiment. Users can enable fault tolerance by configuring the settings as follows:

{
    "launcher_config": {
        "fault_tolerance": true
    }
}

When fault_tolerance is enabled, the launcher automatically restarts any client that is terminated for any reason, including:

  • Consecutive messages from a client exceed study_options.simulation_timeout seconds (possibly due to a failure).
  • The client encounters scheduling issues on the supercomputer.
  • The client reaches its allocated wall time and is terminated by the scheduler.

Additionally, the launcher continuously monitors the server's status and will automatically relaunch it if a failure is detected. The server may be restarted if:

  • No messages are received from the server for launcher_config.server_timeout seconds.
  • The server encounters scheduling issues on the supercomputer.
  • The server reaches its allocated wall time and is terminated by the scheduler.

Checkpointing the Server

The server maintains critical state information that must be checkpointed to ensure a seamless restart after failure. This state information includes:

  • Completed and unfinished clients.
  • Statistical fields (for Sensitivity Analysis).
  • Buffer (reservoir) object (for Deep Learning studies).
  • Neural Network weights, biases, optimizer state, and batch progress (for Deep Learning studies).

Checkpointing is triggered by calling self.checkpoint() within the server. For Sensitivity Analysis, this process occurs at a user-defined interval, configured as follows:

{
    "sa_config": {
        "checkpoint_interval": 100
    }
}

This means the server will create a checkpoint after every 100 samples received from clients.

For Deep Learning studies, the checkpointing approach is similar, where checkpoint_interval specifies the number of processed batches before a checkpoint is saved. Users can configure this setting as follows:

{
    "dl_config": {
        "checkpoint_interval": 100
    }
}

By default, checkpoint_interval is set to the value of nb_batches_update. For greater flexibility, users can manually trigger a checkpoint at any batch index by calling self.checkpoint(batch_idx) on their custom servers.

Launcher Failure

If the launcher fails, the simulation must be restarted manually. However, users can resume the simulation from the last checkpoint by enabling load_from_checkpoint in the launcher_config:

{
    "launcher_config": {
        "load_from_checkpoint": true
    }
}

With load_from_checkpoint enabled, the launcher ensures that the server is restored from the last checkpoint, just as it would be if the server had failed and restarted automatically. This guarantees continuity, allowing the experiment to resume without loss of progress.