Scheduler Support

Batch schedulers, also known as workload managers, are essential components of supercomputers responsible for:

  1. Allocating access to computing resources for a specified duration.
  2. Initiating, executing, and monitoring jobs on allocated resources.
  3. Managing resource contention by maintaining and prioritizing a queue of pending jobs.

(For more details, see Wikipedia).

From a user's perspective, this translates to submitting, canceling, and monitoring jobs to ensure their proper execution. In Melissa, scheduler support involves defining rules for job submission, cancellation, and state monitoring.

Since different batch schedulers support multiple scheduling techniques, Melissa treats a scheduler as a scheduling strategy, rather than a specific resource management framework. This approach provides flexibility in adapting to various supercomputer environments.

Scheduling Categories

To support various scheduling strategies, scheduling techniques in Melissa are classified into two categories:

  • Direct schedulers require formal interaction through specific requests.
  • Indirect schedulers handle job management using Python's subprocess mechanisms.

The table below provides examples of this classification:

Scheduler Category Script Submission Command Job Monitoring
Slurm Indirect sbatch <some-script>.sh sacct --job=<job-id>
OAR Indirect oarsub --scanscript <some-script>.sh oarstat -j <job-id>
OpenMPI Direct mpirun -np <nb-proc> <some-options> -- <some-executable> returncode of Popen.poll()

To support a specific scheduling strategy, Melissa requires defining functions for job submission, cancellation, and monitoring.

Note

While OpenMPI is not a traditional scheduler, Melissa treats it as one within its framework due to its role in managing parallel execution.

Indirect Scheduling

Melissa assigns a unique identifier (UID) to each job request submitted to the batch scheduler. When requesting a job state update (e.g., via sacct or oarstat), the execution generates standard output files (e.g., slurm.<UID>.out or oar.<UID>.out). The launcher then parses these files to verify the success of the request and extract relevant job state information.

While job cancellation requests follow a similar process, job submission is more complex. Submitting a job requires specifying resource requirements, duration, and other configurations. Since this process is highly customizable, Melissa automatically generates batch scripts based on the user’s configuration options and submits them to the scheduler. The launcher then monitors the success of the submission.

Key Implications of Indirect Scheduling

  • File Overload Risk: Frequent interactions generate many output files, potentially leading to inode saturation on the cluster.
  • Manual Debugging Support: sbatch scripts are saved, allowing users to manually relaunch them for debugging.
  • Performance Dependency: Execution speed depends on cluster occupancy and queue times.
  • Reproducibility Challenges: For deep surrogate training, certain buffer types may prevent exact reproducibility.

Direct Scheduling

Direct scheduling in Melissa relies on Python subprocesses instead of batch schedulers. Similar to how batch scripts are used in indirect scheduling, submission commands are generated based on configuration options, and jobs are directly spawned as subprocesses.

Unlike indirect scheduling, no scheduler-specific commands are required for job monitoring or termination. Instead:
- The subprocess poll function checks job status using the process’s returncode.
- The kill function is used to terminate jobs when necessary.

Indirect scheduling strategies can be adapted to work with any workload manager. For example, both Slurm and OAR can operate in a mode where job handling is done directly within a pre-allocated resource pool. However, in such cases:

  • Proper scheduling applies only to the initial allocation request for the resource pool.
  • The launcher must be carefully configured to ensure jobs are correctly scheduled on available resources, avoiding unnecessary overhead or potential performance degradation.

Key Implications of Direct Scheduling

  • Single Queuing Event: The entire study is queued only once, reducing wait times.
  • Performance Variability: Fixed allocations may lead to suboptimal performance depending on cluster occupancy.
  • Optimized for Deep Surrogates: Direct scheduling ensures continuous data flow and provides near-reproducibility for all buffer types in deep surrogate training.

Melissa Schedulers

Melissa currently supports OAR and Slurm workload managers for supercomputer execution. For local execution, the launcher itself acts as a substitute for a batch scheduler. The following sections provide detailed information on all supported scheduling strategies.

OAR

The basic oar scheduler is indirect. It submits oarsub.<uid>.sh scripts to OAR using the oarsub command and manages job monitoring and cancellation through calls to oarstat and oardel.

Configuration files for OAR are provided for various use cases:

Recent versions of OAR allow users to take advantage of Melissa's elastic and fault-tolerant features through advanced functionalities such as best effort, moldable, and container jobs.

Warning

When using OAR, users must be mindful of the resource hierarchy rules. For example, if the server job requires one host, two gpus with one core each, the launcher should be configured like this: "scheduler_arg_server": "host=1/gpu=2/core=1".

OAR Hybrid

The oar-hybrid scheduler is also indirect, with a unique feature: it requests a job container capable of running the server along with n1 concurrent client jobs. Every n2 client job submissions are placed in the best-effort queue. Since the job container has its own internal job queue, all job monitoring, cancellation, and submission requests are made to the scheduler.

The container is configured with the following options:

 "launcher_config": {
    "scheduler": "oar-hybrid",
    "container_max_number_of_clients": n1,                         # int
    "besteffort_allocation_frequency": n2,                         # int
    "scheduler_arg_container": "<scheduler-options-for-container>"  # List[str]
 }

An example of an oar-hybrid configuration file is available for heat-pde-sa: config_oar.json.

This scheduling strategy was proposed as part of the REGALE project. Its primary objective is to dynamically adapt Melissa's job submission behavior to the cluster's occupancy level, efficiently filling empty resources.

Slurm

The basic slurm scheduler is indirect. It submits sbatch.<uid>.sh scripts to Slurm with sbatch and handles job monitoring/cancellation via calls to sacct and scancel.

Examples of configuration files for Slurm are available for these use cases:

Note

Check Creating a SLURM-docker-cluster if you want to test Melissa with Slurm locally.

Slurm OpenMPI

The slurm-openmpi scheduler is indirect. It submits sbatch.<uid>.sh scripts to Slurm using the sbatch command and handles job monitoring and cancellation through calls to sacct and scancel.

By replacing the srun heterogeneous syntax with the mpirun MPMD syntax, this scheduler enables launching groups of size greater than one when srun heterogeneous job submission is not supported.

Warning

To ensure full flexibility, the total number of tasks is not derived automatically. It is the user's responsibility to ensure consistency between group_size, #SBATCH --ntasks, and mpirun -n.

An example of a configuration file for slurm-openmpi is available for heat-pde-sa: config_slurm.json.

Slurm Global

The slurm-global scheduler is direct. It first requests a heterogeneous pre-allocation for both the server and the clients. Jobs are then submitted to their dedicated partition using srun --het-group=<het-group-id> via Python subprocesses. State monitoring and cancellation are handled directly.

To request such a heterogeneous allocation, the following command is used:

sbatch study_g.sh

For a CPU & GPU allocation, the study_g.sh script would look as follows:

#!/bin/sh

#SBATCH --job-name=global
#SBATCH --time=HH:MM:SS

# CPU options (default het-group)
#SBATCH --nodes=X
#SBATCH --ntasks-per-node=XX

# GPU options (het-group 1)
#SBATCH hetjob
#SBATCH --nodes=Y
#SBATCH --ntasks-per-node=YY
#SBATCH --gres=gpu:YY

exec melissa-launcher --config_name /path/to/use-case/config_<scheduler>

The launcher configuration should include the following options:

{
    "launcher_config": {
        "scheduler_arg_client": ["--ntasks=xx", "--het-group=0", "--time=HH:MM:SS"],
        "scheduler_arg_server": ["--ntasks=yy", "--het-group=1", "--time=HH:MM:SS"]
    }
}

Where xx and yy represent fractions of XX and YY, the total number of tasks for the client and server, respectively.

In this configuration, the launcher does not run on the frontend node.

Warning

Some clusters do not support partition-based heterogeneous submissions. In such cases, the slurm-semiglobal strategy should be preferred.

Slurm Semi-Global

The slurm-semiglobal scheduler is hybrid, meaning that the server is treated indirectly, while the clients are treated directly. It first requests a pre-allocation for the launcher and the clients.

Once the pre-allocation request is satisfied, the launcher runs within it, submits the server job to the batch scheduler, and executes client jobs under the same allocation.

For this scheduling strategy, the launcher must be started as the main process of the client allocation:

sbatch study_sg.sh

Where, for a CPU allocation, study_sg.sh would look as follows:

#!/bin/sh

#SBATCH --job-name=semi-global
#SBATCH --output=melissa-launcher.out
#SBATCH --time=HH:MM:SS
#SBATCH --nodes=X
#SBATCH --ntasks-per-node=XX

exec melissa-launcher --config_name /path/to/use-case/config_<scheduler>

The launcher configuration should include the following options:

{
    "launcher_config": {
        "scheduler_arg_client": ["--ntasks=xx", "--time=HH:MM:SS", "--exclusive"]
    }
}

Where xx represents a fraction of XX, the total number of (pre-allocated) tasks for the client.

An example of the configuration file for slurm-semiglobal is available for heat-pde-dl: config_slurm.json.

Note

In this configuration, the server job state is monitored via life signals only (i.e., via PING messages).

Here's a refined version of the warning:


Warning

The walltime for the pre-allocated job must be longer than the server job's walltime, as both jobs are exclusive and, server job will always run after the pre-allocation. If the pre-allocated job's walltime is too short, the server job may not have enough time to complete its execution.

OpenMPI

The basic openmpi scheduler is direct and was initially designed for local execution.

Examples of configuration files for OpenMPI are available for all use cases:

With this scheduling strategy, optimal performance is achieved when job_limit and timer_delay are correctly configured. However, if the user wishes to overload the processing elements, the --oversubscribe option can be used.

Important Considerations

  • The launcher treats both server and client jobs as a single job submission. Consequently, when setting the job_limit, the launcher will run at most job_limit - 1 clients simultaneously, leaving room for the server job.
  • The timer_delay defines how often the launcher checks the status of a job with the batch scheduler. This setting can significantly impact performance. For short-lived jobs, a high frequency ensures that jobs are submitted quickly, helping to meet the job_limit sooner. However, setting the frequency too high can have a negative effect, as it may overwhelm the batch scheduler with excessive requests, leading to longer processing times, particularly on supercomputers with heavy user traffic.