Fault Tolerance¶
Melissa includes a robust system for fault tolerance and checkpointing the various components of an experiment. Users can activate fault tolerance by setting:
in their config file. Whenfault_tolerance
is activated, individual clients are restarted by the launcher when/if a client is killed for any reason, including if:
- two subsequent messages from a single client exceed
study_options.simulation_timeout
seconds - a client fails due to scheduling problems on the supercomputer
- a client hits a wall-time and is killed by the scheduler
Meanwhile, the launcher will also detect if the server has failed for any reason, and will relaunch it. The server may relaunch after:
- the server hasn't sent any messages to the launcher for
launcher_config.server_timeout
seconds - the server fails due to scheduling problems on the supercomputer
- the server hits a wall-time and is killed by the scheduler
Checkpointing the server¶
The server contains important state information that needs to be checkpointed for a seamless restart after failure. The state information includes:
- completed clients and the unfinished clients
- statistics fields (for Sensitivity Analyses)
- buffer (reservoir) object (for Deep Learning studies)
- Neural Network weights/biases, optimizer, batch progress (for Deep Learning Studies)
Checkpointing is performed by calling self.checkpoint()
inside the server. For sensitivity analyses, this occurs based on a configuration interval setting:
Which indicates that the server will checkpoint itself after every 100 samples received from the clients. Meanwhile, for deep learning experiments, the checkpointing is performed by the user calling self.checkpoint(batch_number)
in their custom server. For example, in the examples/heat-pde/heat-pde-dl/heatpde_dl_server.py
, the server is checkpointed right after validation, which means the server gets checkpointed every n_batches_update
batches. The user simply calls:
where batch
is the current batch number.
Launcher failure¶
When the launcher fails, the simulation must be restarted manually. In this case, the user can elect to continue the simulation where they left off by activating load_from_checkpoint
in their launcher_config
The launcher will then ensure that the server is instantiated from the checkpoint information, exactly the same way it would be initialized if the server had failed.