Melissa Server¶
The Melissa server serves as the central coordinator of the Melissa ecosystem. It:
- Establishes and manages communication between the server, launcher, and clients.
- Processes all incoming data from clients via the Melissa API.
- Instructs the launcher to schedule new client instances.
- Detects failures and re-launches clients when needed.

Base Server¶
Base server handles,
- Connections with the Launcher
- Parameter Sampling
- Client Script Generation
- Basic Monitoring with the help of the Launcher
- General Checkpointing of Server Metadata
- Fault Tolerance
As shown above, BaseServer serves as the primary parent class in the inheritance hierarchy. Each subclass specializes in logistics, and management of the reception, and the main processing. For instance, DeepMelissaServer utilizes round-robin communication to gather client data into a buffer for training, whereas SensitivityAnalysisServer employs NxM communication to conduct real-time statistical analysis of field data.
Understanding FaultTolerance Class¶
Melissa’s elastic and fault-tolerant design enables it to handle client job failures effectively through mechanisms at both the launcher and server levels. At the server level, the FaultTolerance class is responsible for managing client failures. It is instantiated as the ft object within BaseServer and provides multiple methods to detect and react to failures.
- Detecting Failures: When the launcher identifies a client job failure, it notifies the server via the
launcherfdsocket. The server can then restart the failed client group with the same or newly generated inputs. - Monitoring Timeouts: The server uses an internal
timerfdsocket to track client job timeouts. Everysimulation_timeoutseconds, it checks for non-finished clients that haven’t sent messages since the last verification.
Handling Different Failure Types¶
-
Non-Deterministic Failures: If failures are random, simply restarting the affected clients should be sufficient.
-
Deterministic Failures: Incorrect study configurations can cause widespread client failures that result in the study being aborted. Alternatively, if failures stem from input values that prevent convergence, the system employs a retry strategy: after
crashes_before_redrawfailures, clients are restarted with new, randomly sampled inputs.
Note
If fault_tolerance is set to false, the mechanism is disabled. Failures will still be detected, but instead of restarting failed instances, the study will stop immediately. Additionally, fault_tolerance is closely linked to the server's checkpointing mechanism, which is further explained in the Fault Tolerance section.
Understanding the Different Communication Protocols¶
A Melissa server is designed to collect data from instrumented clients, which send their data through ZeroMQ sockets. However, since clients operate within a different MPI_COMM_WORLD than the server, data transfer requires a structured approach. The sensitivity-analysis and deep-learning servers process data differently, leading to the use of two distinct communication protocols:
- NxM: In this protocol, the client and server are parallelized over
NandMranks, respectively, ensuring that data is evenly distributed across all server ranks. This approach is well-suited for the sensitivity-analysis server, as statistics are computed independently for each mesh element, meaning the server does not require the full solution for each sample.
-
Round-robin: For multi-rank clients, each completed timestep is first gathered on client's rank 0, as deep learning workflows typically require the full solution for training. When a timestep is assigned to a specific server rank, all other ranks receive a message with an empty data array, which is automatically discarded by the server. The round-robin protocol supports two modes deciding the target server rank:
- Default mode: Timesteps are distributed cyclically across all server ranks — each timestep from a client is sent to a different server rank in a round-robin fashion.
- Bound trajectory mode: Some deep-learning studies may require an entire trajectory from a single client to be processed by the same server rank and stored in the same buffer. This can be enabled by setting:
Warning
- The
bind_simulation_to_server_rankoption is only compatible with theReservoirbuffer type. Since the evictions occur on writes in this buffer, therefore reservoirs on all server ranks always contain some data to yield batches when the remaining number of clients is less than the total server ranks.
The Melissa API determines which protocol to use based on the sobol_op and learning attributes of the server:
| Server | learning |
Protocol |
|---|---|---|
| Deep-learning | 2 |
Round-robin |
| Sensitivity-analysis | 0 |
NxM |
Sensitivity Analysis Server¶
SA server handles,
- Reception Logic
- Iterative statistics Computation
- Checkpointing
Users must inherit the SensitivityAnalysisServer and set a parameter sampling strategy. Building an SA Server section walks through how to build your own SA server.
Key aspects¶
- SA server utilizes the
compute_statsfunction to calculate the required statistics. These computations are now handled using the Pythoniterative-statslibrary, which provides efficient implementations of various iterative statistical methods. - When a client calls
melissa_sendon a field, it transmits partial data instead of the entire time step. This partial data is stored inPartialSimulationDataon the server side and passed directly to thecompute_statsfunction, allowing independent computation without waiting for the full time step.
Understanding Sobol Indices Computation and the Pick-Freeze Method¶
The Sobol indices computation utilizes the pick-freeze method, which aims to generate two independent and nb_parameters correlated samples required by the Martinez formula. This method is thoroughly explained in Section 3 of Terraz et al..
In Melissa, the procedure is handled by constructing groups of size nb_parameters + 2, with inputs sampled according to the pick-freeze method. First, two input lists are generated by calling the draw function of the sample generator twice. Then, the remaining nb_parameters sets of inputs are created by combining the first two sets.
Warning
The pick-freeze method assumes that the sampling function (in this case draw) provides independent uniformly distributed random variables and is hence not compatible with any Design of Experiment.
Previous Architecture (Deprecated)¶
Important
The Melissa Client API has undergone a significant architectural change regarding Sobol sensitivity analysis workflows. The previous implementation employed a hierarchical group-based communication pattern that has been deprecated due to performance and maintainability concerns.
The former approach utilized a two-hop communication strategy where clients participating in Sobol studies were organized into groups sharing a common MPI communicator. During each melissa_send operation, all group members would gather their solutions at a designated group leader (rank 0), which subsequently handled all server communication. This design introduced unnecessary communication overhead and complex group management logic.
Current Architecture (Sobol Caching)¶
The current implementation has eliminated group-level client coordination entirely. Each client now operates independently and communicates directly with the server. Group semantics are maintained exclusively on the server side, where client membership is tracked for proper Sobol computation organization.
The server-side has been redesigned to handle Sobol computations through a caching mechanism. The server receives time step data from individual clients and organizes them according to their respective groups. Once all required time steps for a group are collected, the server assembles the complete dataset and forwards it to the compute_stats function for Sobol index calculation.
This architectural revision eliminates client-side coordination complexity while maintaining computational correctness and improving overall system scalability.
Group-Level Job Submission Constraints
Job scheduling operates at the group level rather than individual simulation instances. The job_limit parameter specifies the maximum number of concurrent group jobs, not individual simulations.
Example: With group_size = 4 and job_limit = 2 (group jobs) + 1 (server job), all simulations within a group are submitted as a single job unit. Therefore, running 2 groups i.e 8 simulations. This design ensures optimal resource utilization by executing all group members simultaneously, thereby ensuring proper cache clearance.
Deep Learning Server¶
DL server handles,
- Reception Logic
- Configuration of buffer, dataset, dataloader, etc.
- Framework-agnostic training
- Checkpointing of model, optimizers, etc.
Users must inherit either the TorchServer or TFServer and set a parameter sampling strategy. Building a DL Server section walks through how to build your own DL server.
Key aspects¶
Understanding Buffer/Reservoir¶
Online training architectures, like Melissa's, rely on a carefully managed "reservoir" or buffer. This buffer acts as a bridge between the client (data generator) and the dataset iterator. While clients continuously put new samples into the buffer, the dataset iterator gets samples to form training batches. The Melissa buffer effectively handles several key challenges:
-
Is the buffer sufficiently populated for diverse sampling? The
per_server_watermarkparameter indl_configcontrols this threshold. If the buffer lacks enough samples, the dataset iterator will wait before building training batches. Once the watermark is reached, sampling begins. -
How does the dataset iterator sample from the buffer? Melissa provides three sampling methods:
- FIFO: A simple queue where the oldest sample is removed on read (
get). - FIRO: Instead of removing the oldest sample, this method randomly selects a sample for eviction (uniform sampling), also on read (
get). - Reservoir: Designed to reduce bias and prevent catastrophic forgetting, this buffer evicts samples on write. It retains samples as long as possible, replacing a randomly chosen sample only when full, ensuring diverse repetitions in online training.
- FIFO: A simple queue where the oldest sample is removed on read (
-
What happens when clients stop sending data, but the buffer still has samples? By default, Melissa continues drawing samples until the buffer is emptied, or the required batches are formed. For the
Reservoirbuffer, which retains samples as long as possible, Melissa transitions from eviction on write to eviction on read when all clients finish sending data. This ensures the buffer is fully flushed at the end of training, preventing an infinite training loop.
Users can set the buffer based on the available resources as follows:
Understanding Datasets & Dataloaders¶
- The dataset is initialized with the user-selected buffer and is responsible for feeding samples from the buffer to the training loop. This dataset inherits
MelissaIterableDatasetclass which has a__iter__method implemented for yielding samples from the buffers. By default, the dataset stops yielding samples when the server is no longer receiving data from clients and the buffer holds fewer thanper_server_watermarksamples.TorchServer, andTfServeruse specific dataset types which inheritMelissaIterableDataset.
Note
The default logic tracks both the number of batches seen and the expected number of batches. Once the server stops receiving data, it gradually reduces per_server_watermark to zero, ensuring the buffer is emptied. If enough batches have been processed, the training loop terminates, preventing potential deadlocks in distributed server processes.
- The dataloader is selected based on the framework in use:
torch.utils.data.DataloaderforTorchServertf.data.Dataset.from_generatorforTfServer.
If no framework is specified, DeepMelissaServer defaults to GeneralDataLoader, which handles batch creation.
Understanding Training & Validation¶
Regardless of the framework, Melissa follows a standardized training loop where each section is highly customizable through specific training hooks.
def train(self):
...
self._on_train_start()
for batch_idx, batch in enumerate(self._train_dataloader):
batch_idx += self.batch_offset
if self.other_processes_finished(batch_idx):
break
self._on_batch_start(batch_idx)
self.training_step(batch, batch_idx)
self._on_batch_end(batch_idx)
if (
self._valid_dataloader is not None
and batch_idx > 0
and (batch_idx + 1) % self.nb_batches_update == 0
):
self.validation(batch_idx)
self._checkpoint(batch_idx)
# end training loop
self._on_train_end()
...
other_processes_finished prevents deadlocks in multi-rank server training by synchronizing all ranks while ensuring each has data available from its buffer. If any rank completes its training, the loop breaks to maintain efficiency and avoid stalls.
Validation is performed on all server ranks every nb_batches_update steps, provided that users explicitly set the self.valid_dataloader attribute in their custom server class. Users are also responsible for aggregating validation metrics, such as reducing losses, which can be implemented in the on_validation_end() hook.
def validation(self, batch_idx: int):
self._on_validation_start(batch_idx)
for v_batch_idx, v_batch in enumerate(self._valid_dataloader):
self.validation_step(v_batch, v_batch_idx, batch_idx)
self._on_validation_end(batch_idx)
Users can override the on_validation_* hooks and validation_step, or modify the entire validation method if necessary.
Important
In the train and validation methods, hooks with _on prefixes (e.g., _on_train_start()) execute predefined code specific to different server classes like TorchServer and TfServer. However, users should override the corresponding methods without the underscore (e.g., on_train_start()). The predefined hooks automatically call these user-defined methods, allowing customization while preserving built-in functionality.
Distributed Data Parallelism with torch¶
Deep surrogate training in Melissa leverages data parallelism, where each server rank has its own reservoir for batch extraction, which is then fed to the corresponding GPU. Each GPU also holds its own copy of the network architecture, and gradients are aggregated during each back-propagation step. This distributed training setup is supported by torch DDP which manage inter-GPU communication. Proper environment setup is crucial for efficient operation, and in Melissa, this is handled through the setup_environment methods, following the IDRIS guidelines:
Validation with a pseudo offline approach¶
The server offers a pseudo_epochs setting in dl_config for small-scale prototype validation, shifting Melissa from online to pseudo-offline training. This allows users to aggregate all client samples before training, mimicking offline learning while using the basic FIRO buffer. During training, the system samples from the buffer to create pseudo_epochs worth of batches. However, this does not guarantee each data point is seen exactly pseudo_epochs times. Instead, the total batch count follows (nb_time_steps * nb_clients / batch_size) * pseudo_epochs, with uniform random sampling from the full dataset. Users can enable this by setting pseudo_epochs in dl_config. By default, it is set to 1, maintaining standard online training behavior.
Note
- As
per_server_watermarkreachesbuffer_size, the scenario can potentially replicate true offline training i.e produce the number of batches closer to(nb_time_steps * nb_clients / batch_size) * pseudo_epochs.