Creating a Study Configuration¶
Creating a configuration file for Melissa can seem confusing due to the numerous options and detailed settings. However, by breaking it down step by step, we can make the process more manageable.
The JSON configuration file serves as the backbone of a Melissa study, defining key aspects such as simulation and parameter setup, server setup, and job scheduling. Each section plays a crucial role in ensuring smooth execution and efficient resource management.
Since the JSON file is loaded as a Python dict
by the running server instance, users have the flexibility to modify it dynamically. However, certain options require specific keys to be properly defined for the configuration to function correctly.
Important
We only cover minimal aspects of the configuration for running a simple study. To understand all options in details, run melissa-launcher --print-options
.
Top-level JSON dictionary keys of the configuration¶
In any configuration, some options remain constant, while others vary depending on the study type (SA or DL) and the chosen job scheduler.
server_filename
& server_class
¶
Specify the python script which defines the YourServerClass
class that is inheriting one of the Melissa's predefined server classes based on the type of study.
study_options
¶
Specify the number of simulations to run, the number of input parameters to sample for each simulation, the fields that are to be received by the server (the same field names that were registered with the melissa_init
, and melissa_send
calls while creating the solver), and the number of time steps to be sent as shown:
{
"study_options": {
"parameters_sweep_size": 1000,
"nb_parameters": 2,
"field_names": ["field_1", "field_2"],
"nb_time_steps": 100,
// Following are optional
"seed": 123,
"simulation_timeout": 60,
"crashes_before_redraw": 1000,
"verbosity": 2
}
}
Note
seed
is recommended and is used by thenumpy.random
module for parameter sampling.simulation_timeout
is a maximum delay assumed by the server between two messages received from a client. If it exceeds, then it may indicate a client failure.crashes_before_redraw
allows failure threshold for a client. If the number of failures exceeds this limit, then it is presumably the issue related to the parameters sampled which may have caused some instabilities in the solver. In that case, we resample the parameters for those failed clients.verbosity
is specific to the server logs.
server_config
¶
This is a relatively simple option to set up. When the server job is submitted, it runs the following bash script.
#!/bin/sh
set -x
# Melissa will paste the `preprocessing_instructions`
# here
# the remainder of this file should be left untouched.
# melissa-launcher will find and replace values in
# curly brackets (e.g. {melissa_set_env_file}) with
# the proper values.
. {melissa_set_env_file}
echo "DATE =$(date)"
echo "Hostname =$(hostname -s)"
echo "Working directory =$(pwd)"
echo ""
echo $PYTHONPATH
set -e
exec melissa-server --project_dir {path_to_usecase} --config_name {config_name}
If users need to perform some preprocessing before executing melissa-server
, they can add preprocessing commands as follows:
client_config
¶
Similar to server_config
, clients also have the option to execute some preprocessing commands before launching. Additionally, it is expected that the users provide the client executable (script or a binary that runs the solver)
{
"client_config": {
"executable_command": "python3 path/to/the/solver.py",
"command_default_args": ["--nx=100", "--ny=100"],
"preprocessing_commands": [
"module load openmpi",
"export MY_ENV=..."
]
}
}
Following options are expected to change based on the study type as well as the job scheduler being used:
launcher_config
¶
We recommend reading the Scheduler Support guide for an easy setup of job scheduler options.
Beyond scheduling, users must decide whether to enable fault_tolerance
— resubmitting a job upon failure while using checkpoints which is detailed Fault Tolerance guide. The job_limit
parameter determines the maximum number of simultaneous jobs submitted to the scheduler.
The timer_delay
setting controls how often (in seconds) the launcher checks the status of submitted jobs by querying the job scheduler.
A shorter delay means more frequent status checks, allowing the system to detect available job slots faster and submit new jobs sooner. Conversely, a longer delay reduces the frequency of checks, potentially slowing down job submissions.
Warning
- The server job is also managed by the launcher. As a result, only
job_limit - 1
client jobs will run simultaneously. timer_delay
can be considered a hyperparameter for deep-learning studies using theReservoir
buffer, which retains samples for as long as possible. If thetimer_delay
is too long, it may slow down job submissions, increasing the likelihood of repeating the same samples.verbosity
is specific to the launcher logs.
{
"launcher_config": {
"scheduler": "openmpi",
"scheduler_arg_client": ["-n", "1","--timeout", "60"],
"scheduler_arg_server": ["-n", "1","--timeout", "3600"],
"fault_tolerance": false,
"job_limit": 2,
"timer_delay": 10,
"verbosity": 2
}
}
Study-specific options¶
sa_config
for Sensitivity Analysis¶
This option may only require setting the checkpoint_interval
, which instructs the Melissa server to create a checkpoint after receiving the specified number of samples. The checkpoint includes computed statistics and metadata about the study's current state.
{
"sa_config": {
"mean": true,
"variance": true,
"skewness": true,
"kurtosis": true,
"sobol_indices": true,
"checkpoint_interval": 50
}
}
dl_config
for Deep-learning¶
We recommend reading Reservoir section to better understand per_server_watermark
.
nb_batches_update
specifies the number of batches to process before running validation. By default, the dl_config
also includes a checkpoint_interval
option, which controls when to checkpoint the model and user-defined components after processing a set number of batches. This interval defaults to the value of nb_batches_update
.
Note
Users are free to modify this option however they want, and the dictionary is made available through self.dl_config
attribute in the server class instance.
{
"dl_config": {
"nb_batches_update": 70,
"batch_size": 10,
"per_server_watermark": 100,
"buffer_size": 200,
"buffer": "FIFO",
"convert_log_to_df": false,
// Following are user-defined
"validation_input_filename": "/path/to/melissa/examples/heat-pde/heat-pde-dl/offline/sc2023-heatpde-validation/input.npy",
"validation_target_filename": "/path/to/melissa/examples/heat-pde/heat-pde-dl/offline/sc2023-heatpde-validation/target.npy"
}
}