Scheduler support¶
Batch schedulers also sometimes referred to as workload managers are supercomputer components designed to handle three main tasks (see wikipedia):
- allocating access to resources to users for some duration,
- starting/executing/monitoring jobs allocated on said resources,
- arbitrating contention for resources by managing a queue of pending jobs.
From the user perspective, this boils down to submitting/cancelling jobs and monitoring their proper execution. With Melissa, scheduler support hence means defining rules to perform job submission/cancellation and job state monitoring. For a given batch scheduler however, several scheduling techniques exist which is why in the sense of Melissa, a scheduler denotes a scheduling strategy rather than a resource management framework per se.
Scheduling categories¶
To facilitate the support of any scheduling strategy, scheduling techniques were originally divided in two categories: direct and indirect schedulers. For the first one, interaction with the scheduler must happen through formal requests while for the second, the job management is handled with Python subprocess mechanisms. Examples of such classification are given in the table below:
Scheduler | Category | Script submission command | Job monitoring command |
---|---|---|---|
slurm |
Indirect | sbatch <some-script>.sh |
sacct --job=<job-id> |
OAR |
Indirect | oarsub --scanscript <some-script>.sh |
oarstat -j <job-id> |
OpenMPI |
Direct | mpirun -np <nb-proc> --<some-options> <some-executable> |
- |
From there on, supporting a given scheduling strategy requires to define the appropriate functions to submit, cancel and monitor jobs.
Note
Although OpenMPI is not a scheduler in the sense that it is not intended for scheduling tasks, it is, in the context of Melissa, considered as such.
Indirect scheduling¶
Melissa assigns a unique identifier number (UID
) to each request submitted to the batch scheduler. For example when a job state update is requested (e.g. calls to sacct
or oarstat
), the command execution will generate standard outputs (e.g. slurm.<UID>.out
or oar.<UID>.out
) that are parsed by the launcher afterwards. This way the launcher makes sure that the request succeeded and it extracts the targeted information from the generated output (e.g. the job state).
Although requesting a job cancellation is pretty similar to requesting a job update, job submission is more complicated. Indeed, submitting a job necessitates to tell the batch scheduler precisely what are the resources needed, for how long, etc. In other words, job submission is a highly configurable request which is why it makes sense to base it upon batch scripts. With Melissa, the launcher automatically designs such script from the configuration options and then submits it to the batch scheduler. Again, the success of this request is monitored.
Finally, the main consequences of indirect scheduling are:
- interactions generate many files which can quickly saturate the inode limit of the cluster,
- sbatch scripts are saved and can be relaunched manually which significantly facilitates debugging,
- execution performances strongly depend on the cluster's occupancy,
- for deep surrogate training, reproducibility is strictly impossible for some kind of buffers.
Direct scheduling¶
As explained earlier, this technique relies on Python subprocesses. Similarly to submission scripts for indirect scheduling, submission commands are designed from the configuration options. Processes are then directly spawned from such commands. In addition, no scheduler specific command is needed to monitor or kill jobs. Instead, the subprocess poll
and kill
functions are respectively used to monitor the job state via the processes returncode
value or to kill the job.
In practice indirect scheduling strategies can be derived for any kind of workload manager. For instance with both Slurm and OAR, jobs can be treated directly if they are submitted, cancelled or monitored in the frame of a pre-allocation. In such conditions however, proper scheduling only happens for the main allocation request. For this reason, the launcher must be adequately parameterized to make sure jobs are submitted correctly i.e. on available resources. Otherwise, significant overhead and/or performance loss could be induced..
Main consequences of direct scheduling are:
- the whole study only requires to queue once,
- depending on the cluster's occupancy, working in a fixed allocation can yield suboptimal performances,
- for deep surrogate training, direct scheduling guarantees a constant incoming data flow and near-reproducibility for all buffers.
Melissa schedulers¶
For supercomputer execution, Melissa currently only supports OAR and Slurm workload managers while for local execution, the launcher becomes its own substitute of a batch scheduler. All supported scheduling strategies are detailed in the following sections.
OAR¶
The basic oar
scheduler is indirect. It submits oarsub.<uid>.sh
scripts to OAR with oarsub
and handles job monitoring/cancellation via calls to oarstat
and oardel
.
Examples of configuration files for OAR are available for all use cases:
heat-pde-sa
-config_oar.json
,heat-pde-dl
-cofnig_oar.json
,lorenz
-config_oar.json
.
Recent versions of OAR enable to take advantage of the elastic and fault-tolerant sides of Melissa via advanced functionalities such as best effort, moldable and container jobs.
Warning
With OAR, the user must pay attention to the resource hierarchy rules. For instance if for the server job, the user needs one host
, two gpus
with one core
each, the launcher should be configured as follows: "scheduler_arg_server": "host=1/gpu=2/core=1"
.
OAR hybrid¶
The oar-hybrid
scheduler is indirect too. Its particularity is to request a job container capable to run the server and n1
concurrent clients. One client job every n2
submissions, is submitted on the besteffort queue. Since the job container has its own inner job queue, job monitoring, cancellation and submission request are made to the scheduler.
The container is parametrized with the following options:
"launcher_config": {
"scheduler": "oar-hybrid",
"container_max_number_of_clients": n1, #int
"besteffort_allocation_frequency": n2, # int
"scheduler_arg_container": "<scheduler-options-for-container>" # List[str]
}
An example of configuration file for oar-hybrid
is available for heat-pde-sa
: config_oar.json
.
This scheduling strategy was proposed in the frame of the REGALE project. Its main objective is to naturally adapt Melissa job submission behavior to the cluster's level of occupancy by dynamically filling empty resources.
Slurm¶
The basic slurm
scheduler is indirect. It submits sbatch.<uid>.sh
scripts to Slurm with sbatch
and handles job monitoring/cancellation via calls to sacct
and scancel
.
Examples of configuration files for Slurm are available for these use cases:
heat-pde-sa
-config_slurm.json
,heat-pde-dl
-config_slurm.json
.
Note
A virtual cluster equipped with Slurm can be built locally by following the Virtual Cluster tutorial.
Slurm OpenMPI¶
The slurm-openmpi
scheduler is indirect. It submits sbatch.<uid>.sh
scripts to Slurm with sbatch
and handles job monitoring/cancellation via calls to sacct
and scancel
.
By substituting the srun
heterogeneous syntax with the mpirun
MPMD syntax, this scheduler enables to launch groups of size greater than unity when srun
heterogeneous job submission is not supported.
Warning
To ensure full flexibility, the total number of tasks is not derived automatically. Making sure group_size
, #SBATCH --ntasks
and mpirun -n
are consistent is the user's responsibility.
An example of configuration file for slurm-openmpi
is available for heat-pde-sa
: config_slurm.json
.
Slurm global¶
The slurm-global
scheduler is direct. It first requests an heterogeneous pre-allocation for the server and the clients. Jobs are then submitted on their dedicated partition via srun --het-group=<het-group-id>
based Python subprocesses. State monitoring and cancellation are performed in a direct way.
Such heterogeneous allocation must be requested as follows:
Where, for a CPU & GPU allocation, study_g.sh
would look as follows:
#!/bin/sh
#SBATCH --job-name=global
#SBATCH --time=HH:MM:SS
# CPU options (default het-group)
#SBATCH --nodes=X
#SBATCH --ntasks-per-node=XX
#SBATCH hetjob
# GPU options (het-group 1)
#SBATCH --nodes=Y
#SBATCH --ntasks-per-node=YY
#SBATCH --gres=gpu:YY
exec melissa-launcher --config_name /path/to/use-case/config_<scheduler>
The launcher configuration should include the following options:
"scheduler-arg-client": ["--ntasks=xx", "--het-group=0", "--time=HH:MM:SS"],
"scheduler-arg-server": ["--ntasks=yy", "--het-group=1", "--time=HH:MM:SS"]
xx
and yy
are fractions of XX
and YY
.
Note
In this configuration, the launcher does not run on the frontend node.
Warning
Some clusters do not support partition based heterogeneous submissions. In this case, the slurm-semiglobal
strategy should be preferred.
Slurm semi-global¶
The slurm-semiglobal
scheduler is hybrid which means that the server is treated indirectly while clients are treated directly. It first requests a pre-allocation for the launcher and the clients. Jobs are then submitted to the batch scheduler for the server or directly inside the allocation for the clients.
For this scheduling strategy, the launcher must be started as the main process of the client allocation:
Where, for a CPU allocation, study_sg.sh
would look as follows:
#!/bin/sh
#SBATCH --job-name=semi-global
#SBATCH --output=melissa-launcher.out
#SBATCH --time=HH:MM:SS
#SBATCH --nodes=X
#SBATCH --ntasks-per-node=XX
exec melissa-launcher --config_name /path/to/use-case/config_<scheduler>
The launcher configuration should include the following options:
"scheduler-arg-client": ["--ntasks=xx", "--time=HH:MM:SS", "--exclusive"],
"scheduler-arg-server": ["--ntasks=yy", "--time=HH:MM:SS"],
"job_limit": N,
"timer_delay": T
job_limit
corresponds to the total number of client jobs the allocation can hold at the same time and xx
is a fraction of XX
.
An example of configuration file for slurm-semiglobal
is available for heat-pde-dl
: config_slurm.json
.
Note
In this configuration, the server job state is monitored via life signal only (i.e. via PING messages).
OpenMPI¶
The basic openmpi
scheduler is direct and was originally designed for local execution.
Examples of configuration files for OpenMPI are available for all use cases:
heat-pde-sa
-config_mpi.json
,heat-pde-dl
-cofnig_mpi.json
,lorenz
-config_mpi.json
.
With this scheduling strategy, optimal performances are guaranteed only if the job_limit
and timer_delay
are properly set. On the contrary, if the user wants to overload their processing elements, the --oversubscribe
option can be used.