Slurm Docker Cluster¶
Inspired from slurm-docker-cluster repository.
Slurm Docker Cluster is a multi-container Slurm cluster designed for rapid deployment using Docker Compose. It simplifies the process of setting up a robust Slurm environment for development, testing, or lightweight usage.
Getting Started¶
To get up and running with Slurm in Docker, make sure you have the following tools installed:
Containers and Volumes¶
This setup consists of the following containers:
- mysql: Stores job and cluster data.
- slurmdbd: Manages the Slurm database.
- slurmctld: The Slurm controller responsible for job and resource management.
- c1, c2, c3: Compute nodes (running
slurmd).
Persistent Volumes:¶
etc_munge: Mounted to/etc/mungeetc_slurm: Mounted to/etc/slurmslurm_jobdir: Mounted to/home/slurmvar_lib_mysql: Mounted to/var/lib/mysqlvar_log_slurm: Mounted to/var/log/slurm
Building the base docker image¶
cd .slurm
docker build --network host --build-arg SLURM_TAG="slurm-21-08-6-1" -t registry.gitlab.inria.fr/melissa/slurm-docker-cluster:21.08.6 -f Dockerfile .
Building the main docker image¶
cd <MELISSA_ROOT>
docker build -t registry.gitlab.inria.fr/melissa/melissa-slurm:latest -f Dockerfile.slurm .
Starting the Cluster¶
Once the image is built, deploy the cluster with the default version of slurm using Docker Compose:
To specify a specific version and override what is configured in .env, specify
the IMAGE_TAG:
This will start up all containers in detached mode. You can monitor their status using:
Register the Cluster¶
After the containers are up and running, register the cluster with SlurmDBD:
Warning
Wait a few seconds for the daemons to initialize before running the registration script to avoid connection errors like: sacctmgr: error: Problem talking to the database: Connection refused.
For real-time cluster logs, use:
Accessing the Cluster¶
To interact with the Slurm controller, open a shell inside the slurmctld container:
Now you can run any Slurm command from inside the container:
[login_slurmctld /]# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up 5-00:00:00 2 idle c[1-2]
Submitting Jobs¶
The cluster mounts the slurm_jobdir volume across all nodes, making job files accessible from the /home/slurm directory. To submit a job:
[login_slurmctld /]# cd /home/slurm
[login_slurmctld data]# sbatch --wrap="hostname"
Submitted batch job 2
Check the output of the job:
Cluster Management¶
Stopping and Restarting:¶
Stop the cluster without removing the containers:
Restart it later:
Deleting the Cluster:¶
To completely remove the containers and associated volumes:
Advanced Configuration¶
You can modify Slurm configurations (slurm.conf, slurmdbd.conf) on the fly without rebuilding the containers. Just run:
This makes it easy to add/remove nodes or test new configuration settings dynamically.