Creating a Study Configuration

Creating a configuration file for Melissa can seem confusing due to the numerous options and detailed settings. However, by breaking it down step by step, we can make the process more manageable.

The JSON configuration file serves as the backbone of a Melissa study, defining key aspects such as simulation and parameter setup, server setup, and job scheduling. Each section plays a crucial role in ensuring smooth execution and efficient resource management.

Since the JSON file is loaded as a Python dict by the running server instance, users have the flexibility to modify it dynamically. However, certain options require specific keys to be properly defined for the configuration to function correctly.

Important

We only cover minimal aspects of the configuration for running a simple study. To understand all options in details, run melissa-launcher --print-options.

Top-level JSON dictionary keys of the configuration

In any configuration, some options remain constant, while others vary depending on the study type (SA or DL) and the chosen job scheduler.

server_filename & server_class

Specify the python script which defines the YourServerClass class that is inheriting one of the Melissa's predefined server classes based on the type of study.

{
  "server_filename": "your_server_file_name.py",
  "server_class": "YourServerClass"
}

study_options

Specify the number of simulations to run, the number of input parameters to sample for each simulation, the fields that are to be received by the server (the same field names that were registered with the melissa_init, and melissa_send calls while creating the solver), and the number of time steps to be sent as shown:

{
    "study_options": {
        "parameters_sweep_size": 1000,
        "nb_parameters": 2,
        "field_names": ["field_1", "field_2"],
        "nb_time_steps": 100,
        // Following are optional
        "seed": 123,
        "simulation_timeout": 60,
        "crashes_before_redraw": 1000,
        "verbosity": 2
    }
}

Note

  • seed is recommended and is used by the numpy.random module for parameter sampling.
  • simulation_timeout is a maximum delay assumed by the server between two messages received from a client. If it exceeds, then it may indicate a client failure.
  • crashes_before_redraw allows failure threshold for a client. If the number of failures exceeds this limit, then it is presumably the issue related to the parameters sampled which may have caused some instabilities in the solver. In that case, we resample the parameters for those failed clients.
  • verbosity is specific to the server logs.

server_config

This is a relatively simple option to set up. When the server job is submitted, it runs the following bash script.

#!/bin/sh
set -x

# Melissa will paste the `preprocessing_instructions`
# here

# the remainder of this file should be left untouched. 
# melissa-launcher will find and replace values in 
# curly brackets (e.g. {melissa_set_env_file}) with 
# the proper values.
. {melissa_set_env_file}

echo "DATE                      =$(date)"
echo "Hostname                  =$(hostname -s)"
echo "Working directory         =$(pwd)"
echo ""
echo $PYTHONPATH

set -e

exec melissa-server --project_dir {path_to_usecase} --config_name {config_name}

If users need to perform some preprocessing before executing melissa-server, they can add preprocessing commands as follows:

{
    "server_config": {
        "preprocessing_commands": [
            "module load openmpi",
            "export MY_ENV=..."
        ]
    }
}

client_config

Similar to server_config, clients also have the option to execute some preprocessing commands before launching. Additionally, it is expected that the users provide the client executable (script or a binary that runs the solver)

{
    "client_config": {
        "executable_command": "python3 path/to/the/solver.py",
        "command_default_args": ["--nx=100", "--ny=100"],
        "preprocessing_commands": [
            "module load openmpi",
            "export MY_ENV=..."
        ]
    }
}

Following options are expected to change based on the study type as well as the job scheduler being used:

launcher_config

We recommend reading the Scheduler Support guide for an easy setup of job scheduler options.

Beyond scheduling, users must decide whether to enable fault_tolerance — resubmitting a job upon failure while using checkpoints which is detailed Fault Tolerance guide. The job_limit parameter determines the maximum number of simultaneous jobs submitted to the scheduler.

The timer_delay setting controls how often (in seconds) the launcher checks the status of submitted jobs by querying the job scheduler.

A shorter delay means more frequent status checks, allowing the system to detect available job slots faster and submit new jobs sooner. Conversely, a longer delay reduces the frequency of checks, potentially slowing down job submissions.

Warning

  • The server job is also managed by the launcher. As a result, only job_limit - 1 client jobs will run simultaneously.
  • timer_delay can be considered a hyperparameter for deep-learning studies using the Reservoir buffer, which retains samples for as long as possible. If the timer_delay is too long, it may slow down job submissions, increasing the likelihood of repeating the same samples.
  • verbosity is specific to the launcher logs.
{
    "launcher_config": {
        "scheduler": "openmpi",
        "scheduler_arg_client": ["-n", "1","--timeout", "60"],
        "scheduler_arg_server": ["-n", "1","--timeout", "3600"],
        "fault_tolerance": false,
        "job_limit": 2,
        "timer_delay": 10,
        "verbosity": 2
    }
}

Study-specific options

sa_config for Sensitivity Analysis

This option may only require setting the checkpoint_interval, which instructs the Melissa server to create a checkpoint after receiving the specified number of samples. The checkpoint includes computed statistics and metadata about the study's current state.

{
    "sa_config": {
        "mean": true,
        "variance": true,
        "skewness": true,
        "kurtosis": true,
        "sobol_indices": true,
        "checkpoint_interval": 50
    }
}

dl_config for Deep-learning

We recommend reading Reservoir section to better understand per_server_watermark.

nb_batches_update specifies the number of batches to process before running validation. By default, the dl_config also includes a checkpoint_interval option, which controls when to checkpoint the model and user-defined components after processing a set number of batches. This interval defaults to the value of nb_batches_update.

Note

Users are free to modify this option however they want, and the dictionary is made available through self.dl_config attribute in the server class instance.

{
    "dl_config": {
        "nb_batches_update": 70,
        "batch_size": 10,
        "per_server_watermark": 100,
        "buffer_size": 200,
        "buffer": "FIFO",
        "convert_log_to_df": false,
        // Following are user-defined
        "validation_input_filename": "/path/to/melissa/examples/heat-pde/heat-pde-dl/offline/sc2023-heatpde-validation/input.npy",
        "validation_target_filename": "/path/to/melissa/examples/heat-pde/heat-pde-dl/offline/sc2023-heatpde-validation/target.npy"
    }
}