Slurm Docker Cluster¶
Inspired from slurm-docker-cluster repository.
Slurm Docker Cluster is a multi-container Slurm cluster designed for rapid deployment using Docker Compose. It simplifies the process of setting up a robust Slurm environment for development, testing, or lightweight usage.
Getting Started¶
To get up and running with Slurm in Docker, make sure you have the following tools installed:
Containers and Volumes¶
This setup consists of the following containers:
- mysql: Stores job and cluster data.
- slurmdbd: Manages the Slurm database.
- slurmctld: The Slurm controller responsible for job and resource management.
- c1, c2, c3: Compute nodes (running
slurmd).
Persistent Volumes:¶
etc_munge: Mounted to/etc/mungeetc_slurm: Mounted to/etc/slurmslurm_jobdir: Mounted to/home/slurmvar_lib_mysql: Mounted to/var/lib/mysqlvar_log_slurm: Mounted to/var/log/slurm
Building the base docker image¶
# from melissa/
docker build --build-arg SLURM_TAG="slurm-25-11-2-1" \
--tag slurm-docker-cluster:25.11.2 \
-f .slurm/Dockerfile .slurm/
Building the main docker image¶
The following image will now use slurm-docker-cluster as its base. This is where all Melissa dependencies will be installed.
# from melissa/
docker build --build-arg PYPROJECT_INSTALL_BRANCH=develop \
--tag melissa-slurm:latest \
-f Dockerfile.slurm .
Note
Set PYPROJECT_INSTALL_BRANCH to Melissa's branch that may have changed the dependencies in pyproject.toml.
Ensure this branch is available on the remote.
Starting the Cluster¶
Once the image is built, deploy the cluster with the default version of slurm using Docker Compose:
This will start up all containers in detached mode. You can monitor their status using:
Register the Cluster¶
After the containers are up and running, register the cluster with SlurmDBD:
Delayed Starts
Wait a few seconds for the daemons to initialize before running the registration script to avoid connection errors like: sacctmgr: error: Problem talking to the database: Connection refused.
For real-time cluster logs, use:
Accessing the Cluster¶
To interact with the Slurm controller, open a shell inside the slurmctld container:
Now you can run any Slurm command from inside the container:
[login_slurmctld /]# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up 5-00:00:00 2 idle c[1-2]
Submitting Jobs¶
The cluster mounts the slurm_jobdir volume across all nodes, making job files accessible from the /home/slurm directory. To submit a job:
[login_slurmctld /]# cd /home/slurm
[login_slurmctld data]# sbatch --wrap="hostname"
Submitted batch job 2
Check the output of the job:
Cluster Management¶
Stopping and Restarting:¶
Stop the cluster without removing the containers:
Restart it later:
Deleting the Cluster:¶
To completely remove the containers and associated volumes:
Advanced Configuration¶
You can modify Slurm configurations (slurm.conf, slurmdbd.conf, cgroup.conf) on the fly without rebuilding the containers. Just run:
This makes it easy to add/remove nodes or test new configuration settings dynamically.