Scheduler support

Batch schedulers also sometimes referred to as workload managers are supercomputer components designed to handle three main tasks (see wikipedia):

  1. allocating access to resources to users for some duration,
  2. starting/executing/monitoring jobs allocated on said resources,
  3. arbitrating contention for resources by managing a queue of pending jobs.

From the user perspective, this boils down to submitting/cancelling jobs and monitoring their proper execution. With Melissa, scheduler support hence means defining rules to perform job submission/cancellation and job state monitoring. For a given batch scheduler however, several scheduling techniques exist which is why in the sense of Melissa, a scheduler denotes a scheduling strategy rather than a resource management framework per se.

Scheduling categories

To facilitate the support of any scheduling strategy, scheduling techniques were originally divided in two categories: direct and indirect schedulers. For the first one, interaction with the scheduler must happen through formal requests while for the second, the job management is handled with Python subprocess mechanisms. Examples of such classification are given in the table below:

Scheduler Category Script submission command Job monitoring command
slurm Indirect sbatch <some-script>.sh sacct --job=<job-id>
OAR Indirect oarsub --scanscript <some-script>.sh oarstat -j <job-id>
OpenMPI Direct mpirun -np <nb-proc> --<some-options> <some-executable> -

From there on, supporting a given scheduling strategy requires to define the appropriate functions to submit, cancel and monitor jobs.

Note

Although OpenMPI is not a scheduler in the sense that it is not intended for scheduling tasks, it is, in the context of Melissa, considered as such.

Indirect scheduling

Melissa assigns a unique identifier number (UID) to each request submitted to the batch scheduler. For example when a job state update is requested (e.g. calls to sacct or oarstat), the command execution will generate standard outputs (e.g. slurm.<UID>.out or oar.<UID>.out) that are parsed by the launcher afterwards. This way the launcher makes sure that the request succeeded and it extracts the targeted information from the generated output (e.g. the job state).

Although requesting a job cancellation is pretty similar to requesting a job update, job submission is more complicated. Indeed, submitting a job necessitates to tell the batch scheduler precisely what are the resources needed, for how long, etc. In other words, job submission is a highly configurable request which is why it makes sense to base it upon batch scripts. With Melissa, the launcher automatically designs such script from the configuration options and then submits it to the batch scheduler. Again, the success of this request is monitored.

Finally, the main consequences of indirect scheduling are:

  • interactions generate many files which can quickly saturate the inode limit of the cluster,
  • sbatch scripts are saved and can be relaunched manually which significantly facilitates debugging,
  • execution performances strongly depend on the cluster's occupancy,
  • for deep surrogate training, reproducibility is strictly impossible for some kind of buffers.

Direct scheduling

As explained earlier, this technique relies on Python subprocesses. Similarly to submission scripts for indirect scheduling, submission commands are designed from the configuration options. Processes are then directly spawned from such commands. In addition, no scheduler specific command is needed to monitor or kill jobs. Instead, the subprocess poll and kill functions are respectively used to monitor the job state via the processes returncode value or to kill the job.

In practice indirect scheduling strategies can be derived for any kind of workload manager. For instance with both Slurm and OAR, jobs can be treated directly if they are submitted, cancelled or monitored in the frame of a pre-allocation. In such conditions however, proper scheduling only happens for the main allocation request. For this reason, the launcher must be adequately parameterized to make sure jobs are submitted correctly i.e. on available resources. Otherwise, significant overhead and/or performance loss could be induced..

Main consequences of direct scheduling are:

  • the whole study only requires to queue once,
  • depending on the cluster's occupancy, working in a fixed allocation can yield suboptimal performances,
  • for deep surrogate training, direct scheduling guarantees a constant incoming data flow and near-reproducibility for all buffers.

Melissa schedulers

For supercomputer execution, Melissa currently only supports OAR and Slurm workload managers while for local execution, the launcher becomes its own substitute of a batch scheduler. All supported scheduling strategies are detailed in the following sections.

OAR

The basic oar scheduler is indirect. It submits oarsub.<uid>.sh scripts to OAR with oarsub and handles job monitoring/cancellation via calls to oarstat and oardel.

Examples of configuration files for OAR are available for all use cases:

Recent versions of OAR enable to take advantage of the elastic and fault-tolerant sides of Melissa via advanced functionalities such as best effort, moldable and container jobs.

Warning

With OAR, the user must pay attention to the resource hierarchy rules. For instance if for the server job, the user needs one host, two gpus with one core each, the launcher should be configured as follows: "scheduler_arg_server": "host=1/gpu=2/core=1".

OAR hybrid

The oar-hybrid scheduler is indirect too. Its particularity is to request a job container capable to run the server and n1 concurrent clients. One client job every n2 submissions, is submitted on the besteffort queue. Since the job container has its own inner job queue, job monitoring, cancellation and submission request are made to the scheduler.

The container is parametrized with the following options:

 "launcher_config": {
    "scheduler": "oar-hybrid",
    "container_max_number_of_clients": n1,                         #int
    "besteffort_allocation_frequency": n2,                         # int
    "scheduler_arg_container": "<scheduler-options-for-container>" # List[str]
 }

An example of configuration file for oar-hybrid is available for heat-pde-sa: config_oar.json.

This scheduling strategy was proposed in the frame of the REGALE project. Its main objective is to naturally adapt Melissa job submission behavior to the cluster's level of occupancy by dynamically filling empty resources.

Slurm

The basic slurm scheduler is indirect. It submits sbatch.<uid>.sh scripts to Slurm with sbatch and handles job monitoring/cancellation via calls to sacct and scancel.

Examples of configuration files for Slurm are available for these use cases:

Note

A virtual cluster equipped with Slurm can be built locally by following the Virtual Cluster tutorial.

Slurm OpenMPI

The slurm-openmpi scheduler is indirect. It submits sbatch.<uid>.sh scripts to Slurm with sbatch and handles job monitoring/cancellation via calls to sacct and scancel.

By substituting the srun heterogeneous syntax with the mpirun MPMD syntax, this scheduler enables to launch groups of size greater than unity when srun heterogeneous job submission is not supported.

Warning

To ensure full flexibility, the total number of tasks is not derived automatically. Making sure group_size, #SBATCH --ntasks and mpirun -n are consistent is the user's responsibility.

An example of configuration file for slurm-openmpi is available for heat-pde-sa: config_slurm.json.

Slurm global

The slurm-global scheduler is direct. It first requests an heterogeneous pre-allocation for the server and the clients. Jobs are then submitted on their dedicated partition via srun --het-group=<het-group-id> based Python subprocesses. State monitoring and cancellation are performed in a direct way.

Such heterogeneous allocation must be requested as follows:

sbatch study_g.sh

Where, for a CPU & GPU allocation, study_g.sh would look as follows:

#!/bin/sh
#SBATCH --job-name=global
#SBATCH --time=HH:MM:SS
# CPU options (default het-group)
#SBATCH --nodes=X
#SBATCH --ntasks-per-node=XX
#SBATCH hetjob
# GPU options (het-group 1)
#SBATCH --nodes=Y
#SBATCH --ntasks-per-node=YY
#SBATCH --gres=gpu:YY

exec melissa-launcher --config_name /path/to/use-case/config_<scheduler>

The launcher configuration should include the following options:

"scheduler-arg-client": ["--ntasks=xx", "--het-group=0", "--time=HH:MM:SS"],
"scheduler-arg-server": ["--ntasks=yy", "--het-group=1", "--time=HH:MM:SS"]
where xx and yy are fractions of XX and YY.

Note

In this configuration, the launcher does not run on the frontend node.

Warning

Some clusters do not support partition based heterogeneous submissions. In this case, the slurm-semiglobal strategy should be preferred.

Slurm semi-global

The slurm-semiglobal scheduler is hybrid which means that the server is treated indirectly while clients are treated directly. It first requests a pre-allocation for the launcher and the clients. Jobs are then submitted to the batch scheduler for the server or directly inside the allocation for the clients.

For this scheduling strategy, the launcher must be started as the main process of the client allocation:

sbatch study_sg.sh

Where, for a CPU allocation, study_sg.sh would look as follows:

#!/bin/sh
#SBATCH --job-name=semi-global
#SBATCH --output=melissa-launcher.out
#SBATCH --time=HH:MM:SS
#SBATCH --nodes=X
#SBATCH --ntasks-per-node=XX


exec melissa-launcher --config_name /path/to/use-case/config_<scheduler>

The launcher configuration should include the following options:

"scheduler-arg-client": ["--ntasks=xx", "--time=HH:MM:SS", "--exclusive"],
"scheduler-arg-server": ["--ntasks=yy", "--time=HH:MM:SS"],
"job_limit": N,
"timer_delay": T
where job_limit corresponds to the total number of client jobs the allocation can hold at the same time and xx is a fraction of XX.

An example of configuration file for slurm-semiglobal is available for heat-pde-dl: config_slurm.json.

Note

In this configuration, the server job state is monitored via life signal only (i.e. via PING messages).

OpenMPI

The basic openmpi scheduler is direct and was originally designed for local execution.

Examples of configuration files for OpenMPI are available for all use cases:

With this scheduling strategy, optimal performances are guaranteed only if the job_limit and timer_delay are properly set. On the contrary, if the user wants to overload their processing elements, the --oversubscribe option can be used.