Skip to content

Slurm Docker Cluster

Inspired from slurm-docker-cluster repository.

Slurm Docker Cluster is a multi-container Slurm cluster designed for rapid deployment using Docker Compose. It simplifies the process of setting up a robust Slurm environment for development, testing, or lightweight usage.

Getting Started

To get up and running with Slurm in Docker, make sure you have the following tools installed:

Containers and Volumes

This setup consists of the following containers:

  • mysql: Stores job and cluster data.
  • slurmdbd: Manages the Slurm database.
  • slurmctld: The Slurm controller responsible for job and resource management.
  • c1, c2, c3: Compute nodes (running slurmd).

Persistent Volumes:

  • etc_munge: Mounted to /etc/munge
  • etc_slurm: Mounted to /etc/slurm
  • slurm_jobdir: Mounted to /home/slurm
  • var_lib_mysql: Mounted to /var/lib/mysql
  • var_log_slurm: Mounted to /var/log/slurm

Building the base docker image

# from melissa/
docker build --build-arg SLURM_TAG="slurm-25-11-2-1" \
    --tag slurm-docker-cluster:25.11.2 \
    -f .slurm/Dockerfile .slurm/

Building the main docker image

The following image will now use slurm-docker-cluster as its base. This is where all Melissa dependencies will be installed.

# from melissa/
docker build --build-arg PYPROJECT_INSTALL_BRANCH=develop \
    --tag melissa-slurm:latest \
    -f Dockerfile.slurm .

Note

Set PYPROJECT_INSTALL_BRANCH to Melissa's branch that may have changed the dependencies in pyproject.toml. Ensure this branch is available on the remote.

Starting the Cluster

Once the image is built, deploy the cluster with the default version of slurm using Docker Compose:

cd .slurm
docker compose up -d

This will start up all containers in detached mode. You can monitor their status using:

docker compose ps

Register the Cluster

After the containers are up and running, register the cluster with SlurmDBD:

./register_cluster.sh

Delayed Starts

Wait a few seconds for the daemons to initialize before running the registration script to avoid connection errors like: sacctmgr: error: Problem talking to the database: Connection refused.

For real-time cluster logs, use:

docker compose logs -f

Accessing the Cluster

To interact with the Slurm controller, open a shell inside the slurmctld container:

docker exec -it slurmctld bash

Now you can run any Slurm command from inside the container:

[login_slurmctld /]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up 5-00:00:00      2   idle c[1-2]

Submitting Jobs

The cluster mounts the slurm_jobdir volume across all nodes, making job files accessible from the /home/slurm directory. To submit a job:

[login_slurmctld /]# cd /home/slurm
[login_slurmctld data]# sbatch --wrap="hostname"
Submitted batch job 2

Check the output of the job:

[login_slurmctld data]# cat slurm-2.out
c1

Cluster Management

Stopping and Restarting:

Stop the cluster without removing the containers:

docker compose stop

Restart it later:

docker compose start

Deleting the Cluster:

To completely remove the containers and associated volumes:

docker compose down -v

Advanced Configuration

You can modify Slurm configurations (slurm.conf, slurmdbd.conf, cgroup.conf) on the fly without rebuilding the containers. Just run:

./update_slurmfiles.sh slurm.conf slurmdbd.conf cgroup.conf
docker compose restart

This makes it easy to add/remove nodes or test new configuration settings dynamically.