GitLab CI Documentation

This document outlines the CI/CD configuration, detailing various components, workflows, and stages in the pipeline.

The GitLab pipeline is designed to be triggered manually through the web interface when the CI_PIPELINE_SOURCE is set to "web", allowing developers to initiate the pipeline for specific testing or deployment needs. Additionally, the pipeline leverages Docker-in-Docker (DinD) version 24.0.5 to create an isolated Docker environment, enabling seamless building and management of containers during the CI jobs.

Unit Tests

  • The launcher was designed as a modular piece of code independent of the batch scheduler at hand. As a results, the higher level structure parts (e.g. the I/O master and state machine) are thoroughly testable.

  • Similarly, the server was entirely revisited and redesigned so that the central objects (e.g. BaseServer, Simulation, FaultTolerance, etc.) could be instantiated and tested separately.

Important Notes

  • Stages utilize Inria’s shared runners (ci.inria.fr, large), which pull Docker images at each stage, albeit slower than local runners. These images are not cached, and thus lead to slower execution times as they are pulled every time.
  • Multi-rank clients or server setups can cause undefined behavior on shared runners.

Update

Final SLURM-global runs 2 ranks per client and server with reduced job limit to accommodate multi-rank runs. This way, we cover almost every real-world situation that occurs. But, there is always a room for improvements!

Artifact Configuration

Artifacts generated during the pipeline are configured to be retained for one week. The following paths are collected as artifacts:

  • Environment setup script: melissa_set_env.sh
  • Build directories: $CI_PROJECT_DIR/build/ and $CI_PROJECT_DIR/install/
  • Heat PDE executables: examples/heat-pde/executables/build/
  • Sensitivity Analysis Results: $SA_RESULTS_DIR

Note

Each stage artifact is configured to be downloadable from the UI on failure. This might not work for the slurm stages that run with DinD (docker-in-docker).

Predefined Variables

The variables are used for avoiding repetitions and make the stage submissions more generic. The following variables are globally defined but can be overridden in specific stages:

  • DOCKER_DRIVER: Set to overlay for Docker storage.
  • SA_RESULTS_URL: URL to download sa_sobol_results.npz.
  • SA_RESULTS_DIR: Directory for storing SA results.
  • BASIC_IMAGE: Base Docker image without CUDA, pulled from GitLab Registry.
  • SLURM_IMAGE: SLURM image for workload management.
  • SLURM_CONTENTS: Directory for SLURM configuration files.
  • MELISSA_ROOT: Project root directory.
  • RESULTS_DIR: Results directory for study outputs, dependent on EXAMPLE_DIR.
  • HEATPDE_EXEC_DIR: Directory for heat PDE executable files.
  • VIRTUAL_ENV: Virtual environment path for Python dependencies.
  • CI_SCRIPTS: Path to CI-related scripts.
  • CI_CONFIGS: Path to CI configuration files.

Result reproducibility for SA studies

You may be wondering what the SA_* variables refer to. These results are contained in a numpy compressed file that is downloaded from the cloud and includes Moments and Sobol sensitivity analysis results produced by the configuration in examples/heat-pde/heat-pde-sa/config_mpi.json. After making updates to several parts of the code, it is crucial to verify that changes across melissa module would not affect the produced results.

Once an SA study completes, we compare the new results with the downloaded ones by running,

python3 tests/l2norm.py $SA_RESULTS_DIR STUDY_OUT/results

File organization

Each stage consists of two main tasks:

  1. Setting environment variables like EXAMPLE_DIR, CONFIG_FILE, etc.
  2. Submitting a relevant script from ci/scripts that depends on these variables.

Since most Melissa jobs follow a similar command structure, we use Bash scripts to streamline study submissions.

├── ci
    ├── configs
       ├── lorenz_fail.py
       ├── openmpi_faulttol_dl_client.json
       ├── openmpi_faulttol_dl.json
       ├── openmpi_faulttol_sa.json
       ├── study_g.sh
       ├── study_sg.sh
       ├── vc_fail_dl_server.py
       ├── vc_fail_sa_server.py
       ├── vc_slurm_dl.json
       ├── vc_slurm_global.json
       ├── vc_slurm_openmpi_dl.json
       ├── vc_slurm_openmpi_sa.json
       ├── vc_slurm_sa.json
       └── vc_slurm_semiglobal.json
    └── scripts
        ├── build.sh
        ├── dl_study.sh
        ├── init_env.sh
        ├── sa_study.sh
        ├── slurm_ft_dl_global.sh
        ├── slurm_ft_dl_pure.sh
        ├── slurm_ft_dl_semig.sh
        ├── slurm_ft_sa_pure.sh
        └── slurm_ompi.sh
├── launcher
   ├── test_io.py
   ├── test_message.py
   └── test_state_machine.py
├── scheduler
   ├── test_dummy.py
   ├── test_openmpi.py
   ├── test_scheduler.py
   └── test_slurm.py
├── server
   ├── simple_sa_server.py
   ├── test_dataset.py
   ├── test_reservoir.py
   ├── test_sensitivity_analysis_server.py
   └── test_server.py
└── utility
    ├── test_functools.py
    ├── test_networking.py
    └── test_timer.py

CI Stages

Basic (OpenMPI) stages

These stages mostly focus on code checking, py unit-tests, deployment, and OpenMPI studies.

Stage Name Description
docker_build Responsible for building and pushing the Docker image.
It uses dind service for container operations.
By default, it only runs on the CI_DEFAULT_BRANCH and
on merge request events given specific files have changed.
build_melissa Responsible for building the Melissa application.
code_check:static:flake8_mypy Performs static code analysis using flake8 and mypy to
check for style violations and type annotations, respectively.
unit_test:dynamic Runs dynamic tests using pytest and collects code coverage information.
pages Deploys the project documentation using mkdocs and
includes the coverage report in the generated output.
Triggered only for commits to the develop branch.
integration_test_openmpi:heat_pde Runs integration tests for the server-side components for
the heat-pde examples (SA and DL) using OpenMPI.
fault_tolerance:heat_pde:server_side Tests the fault tolerance of the server-side components for
the heat-pde examples (SA and DL) using OpenMPI.
fault_tolerance:py_api_lorenz:client_side Tests the fault tolerance of the client-side components for
the Lorenz system. Depends on both the
build_melissa and fault_tolerance:heat_pde:server_side stages.

SLURM CI Stages

Following stages run under slurm pipeline jobs:

Stage Name Description
build_docker_slurm Builds the SLURM Docker cluster image.
Similar behavior as build_docker stage.
build_melissa_slurm Builds the Melissa SLURM environment.
This stage extends the SLURM image initialization and prepares
the environment to run tests. It initializes the SLURM environment
and runs the build script, followed by a test script to validate the setup.
integration_test_slurm_openmpi Runs integration tests for SLURM with OpenMPI support.
This stage executes the integration test scripts in the SLURM
environment using OpenMPI, and cancels the job if the test fails.
fault_tolerance:slurm:server_side Tests fault tolerance for the SLURM server-side configuration.
It runs a series of fault tolerance tests (SA, DL, Semi-Global, and Global)
inside the cluster, and cancels the job if any test fails.

Using SLURM cluster within Gitlab CI

The following is an overview of the procedure for utilizing SLURM with Docker in a CI/CD pipeline, detailing the steps involved in setting up the SLURM cluster and running jobs. For more details, check Creating a SLURM docker cluster.

1. Docker-in-Docker (DinD) Service for SLURM Cluster Build

In the pipeline, we use the Docker-in-Docker (DinD) service to build and manage the SLURM cluster. This allows us to spin up a containerized environment for SLURM without directly running the cluster inside a container. The Docker service helps create the necessary infrastructure for SLURM to function in an isolated environment while providing flexibility and reproducibility across different environments.

2. Mounting $MELISSA_ROOT with Docker-Compose

To ensure that the SLURM setup has access to the necessary files from the current environment, we mount the $MELISSA_ROOT directory into the container using the docker-compose.yml configuration. This step is crucial because it allows the SLURM containers to interact with the code and configuration files located on the host machine. Since the pipeline does not run directly inside a container, this mount enables seamless access to the project files for all stages.

3. Image Pull and Cluster Initialization

Each SLURM-related stage in the pipeline will pull the required Docker image to run the SLURM cluster. In the setup:

  • The slurmctld container acts as the head node (the main controller of the SLURM cluster).
  • The c1, c2, c3, ... containers represent the compute nodes where jobs are executed.

The cluster is launched using Docker Compose, which orchestrates the creation and management of these containers.

4. Register and Update the SLURM Cluster

After the SLURM cluster is launched, the next step is to register the cluster and ensure that its configuration is up-to-date:

  • Cluster Registration: This is done by executing the .slurm/register_cluster.sh script within the slurmctld container. This step ensures that the SLURM controller is properly registered and initialized.
  • Configuration Update: The configuration files slurm.conf and slurmdbd.conf are updated by running the ./update_slurmfiles.sh script. This ensures that the SLURM controller and database configuration files are correctly set up for the environment.

5. Running SLURM Jobs

Once the SLURM cluster is set up and registered, SLURM jobs are submitted and executed under the slurmctld container. These jobs are typically bash scripts that run on the compute nodes managed by the SLURM controller. The jobs can be triggered based on the requirements of the pipeline, such as running simulations or tests.


Important Considerations

1. Environment Variable Management

When working within the SLURM containers, it's important to manage environment variables carefully. Set or override any necessary environment variables in the .slurm/.env file. These variables will be used within the containers to configure the environment as needed for the SLURM jobs.

2. Absolute Paths for Scripts

When executing scripts within the SLURM containers, always ensure the paths to the scripts are specified as absolute paths. This avoids potential issues with relative paths that might be interpreted incorrectly depending on the current working directory inside the container. For example, a preferred a way to launch a job inside the cluster is,

docker exec -t slurmctld bash -c '$MELISSA_ROOT/script.sh'

Warning

When submitting a script (as above), we do not want to expand the variables written in the execution command. By default, the current shell will do so. Therefore, it is important to make the use of single quotes such that they only expand inside the container.

3. Running the SLURM cluster locally for testing SLURM studies

You may want to test out the execution of any changes made locally first. It is simple! Execute the commands below to run a slurm study:

cd .slurm
docker compose up -d && sleep 10
./register_cluster.sh
./update_slurmfiles.sh  slurm.conf slurmdbd.conf
docker exec -t slurmctld bash -c '$CI_SCRIPTS/build.sh'
docker exec -t slurmctld bash -c '$CI_SCRIPTS/slurm_ft_dl_global.sh'

Warning

  • build.sh will create build/ and install/ folders as the container mounts MELISSA/. Therefore, always remove it through the docker container. For example, execute docker exec -t slurmctld bash -c 'rm -rf $MELISSA_ROOT/build' to delete build/.
  • Do the same for any STUDY_OUT/ folders generated through container executions.

Once the study is over run,

docker compose down -v

Automatic Versioning (Optional)

After all pipeline stages complete successfully, an optional versioning stage can be executed to increment Melissa's version. The bump_version stage is designed to handle this using Python's bump2version module.

The current version is stored in the following files:

  • $MELISSA_ROOT/VERSION
  • $MELISSA_ROOT/melissa/version.py

These files are specified in the setup.cfg configuration. When the version is bumped, bump2version automatically updates these files.

To use the versioning stage, following pre-configuration steps take place:

  1. Load the Python Docker Container: Ensure the Python Docker container with python3 and git is properly loaded.

  2. Log In to Git: Authenticate with Git using the current developer's credentials. The account must have Maintainer or Developer privileges.

  3. Set the git push URL: Configure the git push URL to Melissa's remote repository using the CI_ACCESS_TOKEN environment variable.

    • If the current CI_ACCESS_TOKEN has expired, create a new token:
      • Navigate to Settings → Access Tokens.
      • Generate a new token with the write_repository scope.
    • After generating the token, copy it and update the CI_ACCESS_TOKEN environment variable:
      • Go to Settings → CI/CD → Variables and replace the old token with the new one.

The versioning stage is executed manually by the developer. Follow these steps to run the stage:

  1. Set the VERSIONING_TYPE Environment Variable: Before running the stage, create the VERSIONING_TYPE environment variable. It determines how the version is incremented and can be one of patch, minor, or major. The version is automatically updated based on the current value and the specified VERSIONING_TYPE.

  2. Optional: Specify a New Version: Developers can manually specify a version by setting the NEW_VERSION environment variable. This overrides the automatic increment.