Slurm Docker Cluster¶
Inspired from slurm-docker-cluster repository.
Slurm Docker Cluster is a multi-container Slurm cluster designed for rapid deployment using Docker Compose. It simplifies the process of setting up a robust Slurm environment for development, testing, or lightweight usage.
Getting Started¶
To get up and running with Slurm in Docker, make sure you have the following tools installed:
Containers and Volumes¶
This setup consists of the following containers:
- mysql: Stores job and cluster data.
- slurmdbd: Manages the Slurm database.
- slurmctld: The Slurm controller responsible for job and resource management.
- c1, c2, c3: Compute nodes (running
slurmd
).
Persistent Volumes:¶
etc_munge
: Mounted to/etc/munge
etc_slurm
: Mounted to/etc/slurm
slurm_jobdir
: Mounted to/home/slurm
var_lib_mysql
: Mounted to/var/lib/mysql
var_log_slurm
: Mounted to/var/log/slurm
Building the base docker image¶
cd .slurm
docker build --network host --build-arg SLURM_TAG="slurm-21-08-6-1" -t registry.gitlab.inria.fr/melissa/slurm-docker-cluster:21.08.6 -f Dockerfile .
Building the main docker image¶
cd <MELISSA_ROOT>
docker build -t registry.gitlab.inria.fr/melissa/melissa-slurm:latest -f Dockerfile.slurm .
Starting the Cluster¶
Once the image is built, deploy the cluster with the default version of slurm using Docker Compose:
To specify a specific version and override what is configured in .env
, specify
the IMAGE_TAG
:
This will start up all containers in detached mode. You can monitor their status using:
Register the Cluster¶
After the containers are up and running, register the cluster with SlurmDBD:
Warning
Wait a few seconds for the daemons to initialize before running the registration script to avoid connection errors like: sacctmgr: error: Problem talking to the database: Connection refused
.
For real-time cluster logs, use:
Accessing the Cluster¶
To interact with the Slurm controller, open a shell inside the slurmctld
container:
Now you can run any Slurm command from inside the container:
Submitting Jobs¶
The cluster mounts the slurm_jobdir
volume across all nodes, making job files accessible from the /home/slurm
directory. To submit a job:
Check the output of the job:
Cluster Management¶
Stopping and Restarting:¶
Stop the cluster without removing the containers:
Restart it later:
Deleting the Cluster:¶
To completely remove the containers and associated volumes:
Advanced Configuration¶
You can modify Slurm configurations (slurm.conf
, slurmdbd.conf
) on the fly without rebuilding the containers. Just run:
This makes it easy to add/remove nodes or test new configuration settings dynamically.