Slurm Docker Cluster

Inspired from slurm-docker-cluster repository.

Slurm Docker Cluster is a multi-container Slurm cluster designed for rapid deployment using Docker Compose. It simplifies the process of setting up a robust Slurm environment for development, testing, or lightweight usage.

Getting Started

To get up and running with Slurm in Docker, make sure you have the following tools installed:

Containers and Volumes

This setup consists of the following containers:

  • mysql: Stores job and cluster data.
  • slurmdbd: Manages the Slurm database.
  • slurmctld: The Slurm controller responsible for job and resource management.
  • c1, c2, c3: Compute nodes (running slurmd).

Persistent Volumes:

  • etc_munge: Mounted to /etc/munge
  • etc_slurm: Mounted to /etc/slurm
  • slurm_jobdir: Mounted to /home/slurm
  • var_lib_mysql: Mounted to /var/lib/mysql
  • var_log_slurm: Mounted to /var/log/slurm

Building the base docker image

cd .slurm
docker build --network host --build-arg SLURM_TAG="slurm-21-08-6-1" -t registry.gitlab.inria.fr/melissa/slurm-docker-cluster:21.08.6 -f Dockerfile .

Building the main docker image

cd <MELISSA_ROOT>
docker build -t registry.gitlab.inria.fr/melissa/melissa-slurm:latest -f Dockerfile.slurm .

Starting the Cluster

Once the image is built, deploy the cluster with the default version of slurm using Docker Compose:

cd .slurm
docker compose up -d

To specify a specific version and override what is configured in .env, specify the IMAGE_TAG:

cd .slurm
IMAGE_TAG=21.08.6 docker compose up -d

This will start up all containers in detached mode. You can monitor their status using:

docker compose ps

Register the Cluster

After the containers are up and running, register the cluster with SlurmDBD:

./register_cluster.sh

Warning

Wait a few seconds for the daemons to initialize before running the registration script to avoid connection errors like: sacctmgr: error: Problem talking to the database: Connection refused.

For real-time cluster logs, use:

docker compose logs -f

Accessing the Cluster

To interact with the Slurm controller, open a shell inside the slurmctld container:

docker exec -it slurmctld bash

Now you can run any Slurm command from inside the container:

# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up 5-00:00:00      2   idle c[1-2]

Submitting Jobs

The cluster mounts the slurm_jobdir volume across all nodes, making job files accessible from the /home/slurm directory. To submit a job:

# cd /home/slurm
# sbatch --wrap="hostname"
Submitted batch job 2

Check the output of the job:

# cat slurm-2.out
c1

Cluster Management

Stopping and Restarting:

Stop the cluster without removing the containers:

docker compose stop

Restart it later:

docker compose start

Deleting the Cluster:

To completely remove the containers and associated volumes:

docker compose down -v

Advanced Configuration

You can modify Slurm configurations (slurm.conf, slurmdbd.conf) on the fly without rebuilding the containers. Just run:

./update_slurmfiles.sh slurm.conf slurmdbd.conf
docker compose restart

This makes it easy to add/remove nodes or test new configuration settings dynamically.