GitLab CI Documentation¶
This document outlines the CI/CD configuration, detailing various components, workflows, and stages in the pipeline.
The GitLab pipeline is designed to be triggered manually through the web interface when the CI_PIPELINE_SOURCE
is set to "web"
, allowing developers to initiate the pipeline for specific testing or deployment needs. Additionally, the pipeline leverages Docker-in-Docker (DinD) version 24.0.5 to create an isolated Docker environment, enabling seamless building and management of containers during the CI jobs.
Unit Tests¶
-
The launcher was designed as a modular piece of code independent of the batch scheduler at hand. As a results, the higher level structure parts (e.g. the I/O master and state machine) are thoroughly testable.
-
Similarly, the server was entirely revisited and redesigned so that the central objects (e.g.
BaseServer
,Simulation
,FaultTolerance
, etc.) could be instantiated and tested separately.
Important Notes¶
- Stages utilize Inria’s shared runners (
ci.inria.fr
,large
), which pull Docker images at each stage, albeit slower than local runners. These images are not cached, and thus lead to slower execution times as they are pulled every time. - Multi-rank clients or server setups can cause undefined behavior on shared runners.
Update
Final SLURM-global runs 2 ranks per client and server with reduced job limit to accommodate multi-rank runs. This way, we cover almost every real-world situation that occurs. But, there is always a room for improvements!
Artifact Configuration¶
Artifacts generated during the pipeline are configured to be retained for one week. The following paths are collected as artifacts:
- Environment setup script:
melissa_set_env.sh
- Build directories:
$CI_PROJECT_DIR/build/
and$CI_PROJECT_DIR/install/
- Heat PDE executables:
examples/heat-pde/executables/build/
- Sensitivity Analysis Results:
$SA_RESULTS_DIR
Note
Each stage artifact is configured to be downloadable from the UI on failure. This might not work for the slurm
stages that run with DinD
(docker-in-docker).
Predefined Variables¶
The variables are used for avoiding repetitions and make the stage submissions more generic. The following variables are globally defined but can be overridden in specific stages:
- DOCKER_DRIVER: Set to
overlay
for Docker storage. - SA_RESULTS_URL: URL to download
sa_sobol_results.npz
. - SA_RESULTS_DIR: Directory for storing SA results.
- BASIC_IMAGE: Base Docker image without CUDA, pulled from GitLab Registry.
- SLURM_IMAGE: SLURM image for workload management.
- SLURM_CONTENTS: Directory for SLURM configuration files.
- MELISSA_ROOT: Project root directory.
- RESULTS_DIR: Results directory for study outputs, dependent on
EXAMPLE_DIR
. - HEATPDE_EXEC_DIR: Directory for heat PDE executable files.
- VIRTUAL_ENV: Virtual environment path for Python dependencies.
- CI_SCRIPTS: Path to CI-related scripts.
- CI_CONFIGS: Path to CI configuration files.
Result reproducibility for SA studies¶
You may be wondering what the SA_*
variables refer to. These results are contained in a numpy
compressed file that is downloaded from the cloud and includes Moments and Sobol sensitivity analysis results produced by the configuration in examples/heat-pde/heat-pde-sa/config_mpi.json
. After making updates to several parts of the code, it is crucial to verify that changes across melissa
module would not affect the produced results.
Once an SA study completes, we compare the new results with the downloaded ones by running,
File organization¶
Each stage consists of two main tasks:
- Setting environment variables like
EXAMPLE_DIR
,CONFIG_FILE
, etc. - Submitting a relevant script from
ci/scripts
that depends on these variables.
Since most Melissa jobs follow a similar command structure, we use Bash scripts to streamline study submissions.
├── ci
├── configs
│ ├── lorenz_fail.py
│ ├── openmpi_faulttol_dl_client.json
│ ├── openmpi_faulttol_dl.json
│ ├── openmpi_faulttol_sa.json
│ ├── study_g.sh
│ ├── study_sg.sh
│ ├── vc_fail_dl_server.py
│ ├── vc_fail_sa_server.py
│ ├── vc_slurm_dl.json
│ ├── vc_slurm_global.json
│ ├── vc_slurm_openmpi_dl.json
│ ├── vc_slurm_openmpi_sa.json
│ ├── vc_slurm_sa.json
│ └── vc_slurm_semiglobal.json
└── scripts
├── build.sh
├── dl_study.sh
├── init_env.sh
├── sa_study.sh
├── slurm_ft_dl_global.sh
├── slurm_ft_dl_pure.sh
├── slurm_ft_dl_semig.sh
├── slurm_ft_sa_pure.sh
└── slurm_ompi.sh
├── launcher
│ ├── test_io.py
│ ├── test_message.py
│ └── test_state_machine.py
├── scheduler
│ ├── test_dummy.py
│ ├── test_openmpi.py
│ ├── test_scheduler.py
│ └── test_slurm.py
├── server
│ ├── simple_sa_server.py
│ ├── test_dataset.py
│ ├── test_reservoir.py
│ ├── test_sensitivity_analysis_server.py
│ └── test_server.py
└── utility
├── test_functools.py
├── test_networking.py
└── test_timer.py
CI Stages¶
Basic (OpenMPI) stages¶
These stages mostly focus on code checking, py unit-tests, deployment, and OpenMPI studies.
Stage Name | Description |
---|---|
docker_build | Responsible for building and pushing the Docker image. It uses dind service for container operations.By default, it only runs on the CI_DEFAULT_BRANCH andon merge request events given specific files have changed. |
build_melissa | Responsible for building the Melissa application. |
code_check:static:flake8_mypy | Performs static code analysis using flake8 and mypy tocheck for style violations and type annotations, respectively. |
unit_test:dynamic | Runs dynamic tests using pytest and collects code coverage information. |
pages | Deploys the project documentation using mkdocs andincludes the coverage report in the generated output. Triggered only for commits to the develop branch. |
integration_test_openmpi:heat_pde | Runs integration tests for the server-side components for the heat-pde examples (SA and DL) using OpenMPI. |
fault_tolerance:heat_pde:server_side | Tests the fault tolerance of the server-side components for the heat-pde examples (SA and DL) using OpenMPI. |
fault_tolerance:py_api_lorenz:client_side | Tests the fault tolerance of the client-side components for the Lorenz system. Depends on both the build_melissa and fault_tolerance:heat_pde:server_side stages. |
SLURM CI Stages¶
Following stages run under slurm pipeline jobs:
Stage Name | Description |
---|---|
build_docker_slurm | Builds the SLURM Docker cluster image. Similar behavior as build_docker stage. |
build_melissa_slurm | Builds the Melissa SLURM environment. This stage extends the SLURM image initialization and prepares the environment to run tests. It initializes the SLURM environment and runs the build script, followed by a test script to validate the setup. |
integration_test_slurm_openmpi | Runs integration tests for SLURM with OpenMPI support. This stage executes the integration test scripts in the SLURM environment using OpenMPI, and cancels the job if the test fails. |
fault_tolerance:slurm:server_side | Tests fault tolerance for the SLURM server-side configuration. It runs a series of fault tolerance tests (SA, DL, Semi-Global, and Global) inside the cluster, and cancels the job if any test fails. |
Using SLURM cluster within Gitlab CI¶
The following is an overview of the procedure for utilizing SLURM with Docker in a CI/CD pipeline, detailing the steps involved in setting up the SLURM cluster and running jobs. For more details, check Creating a SLURM docker cluster.
1. Docker-in-Docker (DinD) Service for SLURM Cluster Build¶
In the pipeline, we use the Docker-in-Docker (DinD) service to build and manage the SLURM cluster. This allows us to spin up a containerized environment for SLURM without directly running the cluster inside a container. The Docker service helps create the necessary infrastructure for SLURM to function in an isolated environment while providing flexibility and reproducibility across different environments.
2. Mounting $MELISSA_ROOT
with Docker-Compose¶
To ensure that the SLURM setup has access to the necessary files from the current environment, we mount the $MELISSA_ROOT
directory into the container using the docker-compose.yml
configuration. This step is crucial because it allows the SLURM containers to interact with the code and configuration files located on the host machine. Since the pipeline does not run directly inside a container, this mount enables seamless access to the project files for all stages.
3. Image Pull and Cluster Initialization¶
Each SLURM-related stage in the pipeline will pull the required Docker image to run the SLURM cluster. In the setup:
- The
slurmctld
container acts as the head node (the main controller of the SLURM cluster). - The
c1, c2, c3, ...
containers represent the compute nodes where jobs are executed.
The cluster is launched using Docker Compose, which orchestrates the creation and management of these containers.
4. Register and Update the SLURM Cluster¶
After the SLURM cluster is launched, the next step is to register the cluster and ensure that its configuration is up-to-date:
- Cluster Registration: This is done by executing the
.slurm/register_cluster.sh
script within theslurmctld
container. This step ensures that the SLURM controller is properly registered and initialized. - Configuration Update: The configuration files
slurm.conf
andslurmdbd.conf
are updated by running the./update_slurmfiles.sh
script. This ensures that the SLURM controller and database configuration files are correctly set up for the environment.
5. Running SLURM Jobs¶
Once the SLURM cluster is set up and registered, SLURM jobs are submitted and executed under the slurmctld
container. These jobs are typically bash scripts that run on the compute nodes managed by the SLURM controller. The jobs can be triggered based on the requirements of the pipeline, such as running simulations or tests.
Important Considerations¶
1. Environment Variable Management¶
When working within the SLURM containers, it's important to manage environment variables carefully. Set or override any necessary environment variables in the .slurm/.env
file. These variables will be used within the containers to configure the environment as needed for the SLURM jobs.
2. Absolute Paths for Scripts¶
When executing scripts within the SLURM containers, always ensure the paths to the scripts are specified as absolute paths. This avoids potential issues with relative paths that might be interpreted incorrectly depending on the current working directory inside the container. For example, a preferred a way to launch a job inside the cluster is,
Warning
When submitting a script (as above), we do not want to expand the variables written in the execution command. By default, the current shell will do so. Therefore, it is important to make the use of single quotes such that they only expand inside the container.
3. Running the SLURM cluster locally for testing SLURM studies¶
You may want to test out the execution of any changes made locally first. It is simple! Execute the commands below to run a slurm study:
cd .slurm
docker compose up -d && sleep 10
./register_cluster.sh
./update_slurmfiles.sh slurm.conf slurmdbd.conf
docker exec -t slurmctld bash -c '$CI_SCRIPTS/build.sh'
docker exec -t slurmctld bash -c '$CI_SCRIPTS/slurm_ft_dl_global.sh'
Warning
build.sh
will createbuild/
andinstall/
folders as the container mountsMELISSA/
. Therefore, always remove it through the docker container. For example, executedocker exec -t slurmctld bash -c 'rm -rf $MELISSA_ROOT/build'
to deletebuild/
.- Do the same for any
STUDY_OUT/
folders generated through container executions.
Once the study is over run,
Automatic Versioning (Optional)¶
After all pipeline stages complete successfully, an optional versioning stage can be executed to increment Melissa's version. The bump_version
stage is designed to handle this using Python's bump2version
module.
The current version is stored in the following files:
$MELISSA_ROOT/VERSION
$MELISSA_ROOT/melissa/version.py
These files are specified in the setup.cfg
configuration. When the version is bumped, bump2version
automatically updates these files.
To use the versioning stage, following pre-configuration steps take place:
-
Load the Python Docker Container: Ensure the Python Docker container with
python3
andgit
is properly loaded. -
Log In to Git: Authenticate with Git using the current developer's credentials. The account must have Maintainer or Developer privileges.
-
Set the
git push
URL: Configure thegit push
URL to Melissa's remote repository using theCI_ACCESS_TOKEN
environment variable.- If the current
CI_ACCESS_TOKEN
has expired, create a new token:- Navigate to Settings → Access Tokens.
- Generate a new token with the
write_repository
scope.
- After generating the token, copy it and update the
CI_ACCESS_TOKEN
environment variable:- Go to Settings → CI/CD → Variables and replace the old token with the new one.
- If the current
The versioning stage is executed manually by the developer. Follow these steps to run the stage:
-
Set the
VERSIONING_TYPE
Environment Variable: Before running the stage, create theVERSIONING_TYPE
environment variable. It determines how the version is incremented and can be one ofpatch
,minor
, ormajor
. The version is automatically updated based on the current value and the specifiedVERSIONING_TYPE
. -
Optional: Specify a New Version: Developers can manually specify a version by setting the
NEW_VERSION
environment variable. This overrides the automatic increment.