Melissa

DOI

Summary

Melissa is a file-avoiding, fault-tolerant, and elastic framework designed for large-scale sensitivity analysis and large-scale deep surrogate training on supercomputers. Some of its largest studies have utilized up to 30,000 cores to run 80,000 parallel simulations while avoiding up to 288 TB of intermediate data storage (see 1).

Melissa architecture

Traditional sensitivity analysis and deep surrogate training involve running multiple simulation instances with different input parameters, storing the results on disk, and later retrieving them to train a neural network or compute required statistics. However, the storage demands can quickly become overwhelming, leading to long read times and inefficient data processing. To mitigate this, researchers often reduce study sizes by running lower-resolution simulations or down-sampling output data in space and time.

How it works

Melissa (as shown in the figure below) overcomes storage limitations by eliminating intermediate file storage and processing data in transit, enabling large-scale data processing:

  • Sensitivity Analysis Server: Melissa uses iterative statistical algorithms and an asynchronous client-server model for data transfer. Instead of storing simulation outputs on disk, it transmits them via NxM communication patterns to a parallelized server. This approach enables real-time statistical computations without requiring disk storage, allowing full-scale studies with oblivious statistical mapping for every mesh element and time step. Melissa supports various statistical measures (e.g. mean, variance, skewness, kurtosis, and Sobol indices) and can be extended with new algorithms.

  • Deep Learning Server: Following a similar approach, client simulations send data in a round-robin manner to a parallelized, multithreaded server. The server manages a buffer for training batches, ensuring efficient memory use. Once the buffer reaches a predefined safety watermark, selected samples form training batches for distributed training on GPUs or CPUs. Memory is managed dynamically by selecting and evicting samples based on predefined policies, enabling both online and pseudo-offline training by adjusting the buffer size, watermark, and selection/eviction strategies.

Overview of Melissa's deep learning framework

Both sensitivity analysis and deep surrogate training in Melissa depend on three key components:

  1. Melissa Client: This is the parallel numerical simulation code, adapted to function as a client. Each client runs independently and sends mid-simulation output to the server whenever melissa_send() is called.

  2. Melissa Server: A parallel executable responsible for computing statistics or training a Neural Network (more details here). It updates statistics and generates training batches upon receiving new data from any connected client.

  3. Melissa Launcher: A front-end Python script that orchestrates the execution of the study (more details here). It automates large-scale job scheduling in OpenMPI and integrates with cluster schedulers like slurm and OAR, handling job submission, monitoring, and fault tolerance.

User interface

To run an analysis with Melissa, users need to follow these steps:

  1. Instrument the Simulation Code: Modify the simulation to use the Melissa API with three main calls—init, send, and finalize—so it functions as a Melissa client (details here).

  2. Configure the Analysis: Define how simulation parameters are sampled, select statistical computations, or specify the Neural Network architecture and training settings (details here).

  3. Launch the Analysis: Run the Melissa launcher via the terminal or the supercomputer's front-end (quick start guide). Melissa handles resource allocation, execution monitoring, and automatic restarts for failed components.

Melissa’s API currently supports C, Fortran, and Python solvers but can be extended to other languages by following the approach in the API folder.

List of publications


  1. Alejandro Ribés, Théophile Terraz, Yvan Fournier, Bertrand Iooss, and Bruno Raffin. Unlocking large scale uncertainty quantification with in transit iterative statistics. In Hank Childs, Janine C. Bennett, and Christoph Garth, editors, In Situ Visualization for Computational Science, 113–136. Cham, 2022. Springer International Publishing.