Melissa Server

The Melissa server plays a managerial role for the Melissa ecosystem. The server:

  • initializes and oversees communications from server to launcher and from client to server
  • handles all data received from the clients by the Melissa API
  • directs the launcher to schedule new client instances
  • detects faults and re-launches clients if necessary

The base server

As shown below, the BaseServer is the principle parent object in the class inheritance structure. BaseServer contains all the methods required for initializing the API communications with the clients, connecting to the launcher, and directing the launcher to schedule client jobs. The direct children of the BaseServer specialize the data collection, storage, and analysis. For example, the DeepMelissaServer child relies on round-robin communications and ensures that all client data is collected in a "buffer" which is used for training. Meanwhile, the SensitivityAnalysisServer uses NxM communications to perform a statistical analysis of field data in an "on-line" fashion.

image

Understanding the Fault-Tolerance class

As an elastic and fault-tolerant framework, Melissa contains several mechanisms at the launcher and server levels that make it capable of reacting to client job failures.

From the server perspective, the FaultTolerance class instantiate the ft object as an attribute of BaseServer. This class contains multiple methods necessary to react to any sort of client job failure. Hence, when the launcher detects that a client job failed, the server is informed through the launcherfd socket and can restart the corresponding group with the same or with new inputs.

In the meantime, the server uses an inner timerfd socket to trigger the monitoring of client job timeout. Such event occurs every simulation_timeout seconds and enables to detect all timed out clients i.e. all non-finished clients from which the server did not receive a message since the last verification.

If the failures are non-deterministic, restarting the corresponding clients should be enough. In the case they are deterministic, there are two possibilities. Either the study was wrongly configured in which case the failure of all clients will result in the abortion of the study. Or the failure is due to the sampling of inputs for which the data-generator is not able to converge. In this case, after crashes_before_redraw failures, the clients will be restarted with new randomly sampled inputs. With such implementation, any problem occurring at the client level is thus efficiently and robustly noticed.

Note

This mechanism can be turned off by setting fault_tolerance to false. In this case, failures will still be detected but they will cause the study to stop immediately without restarting any Melissa instance.

Note

The fault_tolerance is closely linked to the server checkpointing mechanism, which is highlighted here.

Understanding the different communication protocols

In essence, a Melissa server is designed to collect data from instrumented clients. This instrumentation enables them to send their data through ZeroMQ sockets. However, since the clients are parallelized in a different comm_world as the one of the server, this data transfer is not straightforward. Because the sensitivity-analysis and deep-learning servers process data differently, two protocols are considered:

  • Round-robin - each simulation gathers data on its rank 0. Then for each timestep, the targeted server rank is incremented by one until the total number of ranks is reached. There the mapping starts over from server rank 0. This protocol is particularly suited to the deep-learning server because deep-surrogates are usually designed to predict the solution across the entire computational domain at each timestep. Hence for each sample, the full solution needs to be known by the server.

Note

When one timestep is sent to one rank, all other ranks receive a message with an empty data array that is automatically discarded by the server.

  • NxM - this communication protocols refers to the case where the client and server are respectfully parallelized over N and M ranks. It ensures that data are equitably shared over all server ranks. This protocol is particularly suited for the sensitivity-analysis server because statistics are computed independently on every mesh element. Hence for each sample, the full solution does not need to be known by the server.

These protocols correspond to different conditions in the Melissa API. A specific behavior is triggered thanks to the sobol_op and learning attributes of the considered server:

Server learning sobol_op Protocol
deep-learning 2 0 Round-robin
sensitivity-analysis (without Sobol) 0 0 NxM
sensitivity-analysis (with Sobol) 0 1 NxM

The specific case where Sobol indices are computed (i.e. sobol_op=1) is discussed below.

The sensitivity-analysis server

In this case, users inherit the SensitivityAnalysisServer and only need to abstract the __init__ method to build the parameter generator of their choosing. An example is available at examples/heat-pde/heat-pde-sa/heatpde_sa_server.py where the server is inherited as follows:

class HeatPDEServerSA(SensitivityAnalysisServer):
    def __init__(self, config: Dict[str, Any]):
        """
        Users select their parameter generator here
        """

In addition, the sensitivity-server relies on the compute_stats function that is used to compute the requested statistics. Those statistics are now obtained with the Python iterative-stats library that implements several iterative statistics.

Understanding the Sobol indices computation and the pick-freeze method

The Sobol indices computation relies on the pick-freeze method whose principle is to provide 2 independent and nb_parameters correlated samples required by Martinez formula. This method is described in details in section 3 of Terraz et al..

The way Melissa handles this procedure consists in building groups of size nb_parameters+2 whose inputs are sampled according to the pick-freeze method. To do so, two input lists are first obtained by calling twice the draw function of the sample generator. Then, the remaining nb_parameters set of inputs are obtained by combining the first two.

Finally, since the data needs to be processed at the group level, each sample is first gathered on simulation 0 of the group before being sent to the server. As explained earlier, this behavior is automatically triggered by hitting a specific condition of the API when sobol_op=1 and learning=0. Consequently, the solutions from the whole group do not need to be assembled on the server side and can directly be passed to the compute_stats function as an array of size group_size*data_size.

Warning

The pick-freeze method assumes that the sampling function (in this case draw) provides independent uniformly distributed random variables and is hence not compatible with any design of experiments.

The deep-learning server

Users inherit from one of the two mid-level classes to add their study-specific methods. A variety of examples are available, including the examples/lorenz/lorenz_server.py and examples/heat-pde/heat-pde-dl/heatpde_server.py, which contain custom user created classes. In lorenz_server.py, users will find a class called LorenzServer geared at training a surrogate model to learn the lorenz equation. LorenzServer inherits from TorchServer as follows:

class LorenzServer(TorchServer):
    def __init__(self, config: Dict[str, Any]):
        """
        Users select their parameter generator here
        """

    def train(self, model: torch.nn.Module):
        """
        Users define their train loop here
        """

    def set_model(self):
        """
        Users define their neural architecture here
        """

    def configure_data_collection(self):
        """
        Users define a buffer type here
        """

As shown in the class diagram above, the user defined server class can also override any base-level functionality necessary. This permits advanced users to customize various pieces of the server, all from this single file.

Distributed data parallelism with torch and tensorflow

Deep surrogate training with Melissa takes advantage of what is often referred to as data parallelism. Each server process has its own reservoir (see next section) from which batches are extracted and fed to the process associated GPU. In the meantime, each GPU has its own copy of the network architecture and all gradient computations are aggregated together every backpropagation step. Such distributed training is made possible thanks to torch DDP and tensorflow distributed training features which take care of the inter-GPU communications. To handle such communications, special attention must be paid when setting up the environment. In Melissa, this is done with setup_environment methods (see here and there, respectively for torch and tensorflow) which were designed from idris guidelines:

Warning

With torch, tensors and operations can explicitly be associated to a given device with the to method. With tensorflow however, process/GPU pinning has proved to be more difficult to achieve with the MultiWorkerMirroredStrategy. When setting up the distributed environment, the SlurmClusterResolver method relies by default on slurm environment variables as well as on nvidia-smi to identify information such as the list of hosts, the number of tasks, of nodes, and of GPU available. In order to make sure that each process only sees its associated GPU, the set_visible_devices method is used to only make the corresponding process id visible. This is done automatically when auto_set_gpu=True but may fail if the CUDA environment has been instantiated prior to the SlurmClusterResolver call. With Melissa, the environment variable is hence set in setup_environment as follows:

physical_devices = tf.config.list_physical_devices('GPU')
local_rank = self.rank % len(physical_devices)
tf.config.set_visible_devices(physical_devices[local_rank], 'GPU')

Understanding the reservoir

Online training architectures, such as the one present in Melissa, require careful use of a "reservoir." The reservoir, also referred to as a buffer, is the intermediary between the client (i.e. the data generator) and the dataset iterator. The clients want to put new samples into the buffer, and the dataset iterator wants to get samples to build batches for the training loop. Some key concerns that the Melissa buffer addresses:

  • Is there enough data in the buffer yet so that the sampling population is diverse enough? Users set the per_server_watermark in the dl_config to control this level. If there are not yet sufficient number of samples available, the dataset iterator will not build batches for training. As soon as the per_server_watermark is achieved, the dataset iterator begins selection.
  • How is the dataset iterator sampling from the buffer? Melissa offers a variety of sampling methods depending on the user needs. The FIRO is the most simplified buffer available, since it simply selects a random sample from the buffer and removes it. This means that each sample is seen one time exactly, and the sampling is a uniform random distribution across the entire buffer, which has a maximum size designated by the user in dl_config.buffer_size.
  • What happens when the clients are finished sending data to the server, but the buffer still contains samples? Melissa has some default logic for handling this situation, where the dataset will continue to draw samples from the buffer until the buffer is emptied (or the number of batches is satisfied). Users can override this logic by setting other_processes_finished() in their custom server class.

The buffer is set inside the user made server class (e.g. LorenzServer):

class LorenzServer(TorchServer):

    def configure_data_collection(self):
        """
        Users define a buffer type here
        """
        self.buffer = FIRO(self.buffer_size, self.per_server_watermark)
        self.dataset = MelissaIterableDataset(buffer=self.buffer)

If the user wishes to select a different buffer type from melissa/server/deep_learning/reservoir.py, they simply change the name FIRO to their desired buffer.

Validation with a pseudo offline approach

The server has a non-default setting for small-scale prototype validation called pseudo_epochs(inside the dl_config) which changes the behavior of melissa from online training to pseudo-offline training. The goal of this setting is to provide users the ability to use melissa and the basic FIRO to aggregate all client samples before initiating training (similar to a true offline training). Further, the training loop will sample from the buffer to create pseudo_epochs worth of batches during training. It is important for the user to understand that this does not guarantee each point will be seen pseudo_epoch number of times, instead it means that the total number of batches will be equivalent to (num_samples * num_clients / batch_size) * pseudo_epochs. The point sampling will be from the buffer containing the full number of num_samples * num_clients points, but it will still employ the uniform random sampling (i.e. not all points will be seen an equal number of times). Users can activate this setting inside the dl_config with pseudo_epochs. By default, pseudo_epochs is set to 1 and does not change online buffer/training behaviors.

Understanding the dataset

The dataset is initiated using the user selected buffer, as shown above. The purpose of the dataset is to try to yield samples from the buffer back to the user-made training loop. Default dataset logic will stop yielding samples if the server is no-longer receiving samples from the client and the buffer does not contain more than per_server_watermark number of samples.

Note

The default logic tracks the number of batches seen as well as the number of batches expected, and will empty the buffer by reducing the per_server_watermark to 0 after the server process finishes receiving data. If the number of batches seen is sufficient, the training loop breaks - avoiding any possible deadlocks that may occur across distributed server processes.