Skip to content

External Validation for Online Training

Why is External Validation Required?

In an online training setup, the Deep Learning (DL) server must process a continuous stream of data arriving from multiple concurrent simulations. This creates a critical timing bottleneck:

  • The Problem: Traditional validation is computationally expensive. If the training server pauses to perform validation every batches, the data reservoir may become unstable or simulations may be forced to idle, leading to significant throughput degradation.
  • The Solution: Melissa moves validation to a separate process (typically on a separate GPU). This allows the training server to maintain maximum speed while validation occurs asynchronously in the background.

How it Works

The External Validation system consists of two primary components: the Validation Messenger (integrated into DeepMelissaServer) and the External Validator (a standalone entry point).

Bi-Directional Signaling Protocol

Melissa uses ZeroMQ for structured messaging:

  1. Dynamic Discovery: On startup, the DL Server (Rank 0) binds to available ports and writes a validation_connection.json file to the checkpoint directory. The External Validator reads this file to establish a connection automatically.

  2. Structured Checkpoint Trigger: Users must manually call self.ext_val_send_checkpoint_signal(metadata). The system sends a metadata dictionary instance, allowing the validator to receive any custom options such as a checkpoint path to load for the current validation run.

  3. Two-Way Communication: The Validator can send a TERMINATE signal back to the Server (See the example below). If the validator detects any metric divergence (e.g., NaN loss or accuracy below a threshold), it can request a training shutdown leading to a graceful termination.

  4. DDP-Aware Synchronization: In DDP mode, Rank 0 receives the termination request and synchronizes it across all server ranks. This ensures a graceful, synchronized exit for the entire cluster.

The Validation Signal

All communication is governed by a typed ValidationSignal dataclass:

  • Type: An Enum (CHECKPOINT, STOP, or TERMINATE).
  • Data: A flexible dictionary object containing checkpoint metadata.
  • Reason: Optionally, a string field for describing termination causes or stop status.

Usage Pattern

Users must set the following option such that the server instantiates a messenger that can communicate with the external validation process.

{
    "dl_config": {
        "external_validation": true
    }
}

To implement a custom validator script, users can inherit from BaseExternalValidator and implement their validation logic as required.

import numpy as np
import torch
from melissa.utility.external_validator import (
    BaseExternalValidator, get_default_kwargs
)
from custom_utils import compute_metrics, check_early_stop

class MyValidator(BaseExternalValidator):

    def __init__(self, arg1, arg2, **base_kwargs):
        super().__init__(**base_kwargs)
        self.arg1 = arg1
        self.arg2 = arg2

    def validation_entrypoint(metadata: dict):
        """Main entrypoint executed after receiving the checkpoint
        signal from the DL server."""

        # user-defined aspects
        path = metadata.get("latest_model_path")
        model = torch.load(path)
        val_loss = compute_metrics(model)

        # send an explicit termination request to stop the training server
        if not np.isfinite(val_loss):
            self.request_server_termination(
                reason="Loss contains NaN/Inf"
            )

        if check_early_stop(val_loss):
            self.request_server_termination(
                reason="Early stopping as validation loss did not improve"
            )            


if __name__ == "__main__":
    # get_default_kwargs() returns a dict compatible with the base class
    base_ext_val_kwargs = get_default_kwargs()
    validator = MyValidator(1, 2, **base_ext_val_kwargs)
    validator.run()

To execute the external validator, users must treat it as a separate job or a standalone process that runs in parallel with the main Melissa study. The validator relies on the checkpoints/validation_connection.json file created by the server for making a handshake.

Running External Validation alongside Melissa Study

In an HPC environment, the most efficient way to run this is by using two separate allocations. This prevents the validation logic from competing for the training server's CPU/GPU resources.

Below is an example of SLURM sbatch script that uses heterogeneous allocation to submit a slurm-global study:

#!/bin/bash
#SBATCH --job-name=melissa-study-external-val
#SBATCH --output=melissa-%j.out

# GROUP 0: CLIENT ALLOCATION
#SBATCH --nodes=1 --ntasks-per-node=40

#SBATCH hetjob
# GROUP 1: SERVER ALLOCATION
#SBATCH --nodes=1 --ntasks=2 --gres=gpu:2

#SBATCH hetjob
# GROUP 2: EXTERNAL VALIDATOR ALLOCATION
#SBATCH --nodes=1 --ntasks=1 --gres=gpu:1

# 1. Initialize the environment (melissa and its dependencies)

# 2. Launch validator on the second allocation (Group 2)
# It will wait for the server to write 'validation_connection.json'
srun --het-group=2 --exclusive \
    python3 my_validator.py --config config_slurm_global.json --poll-inteval 30 \
    &                             # < ------ MIND THE & TO ENSURE NON-BLOCKING EXECUTION

# 3. Launch Melissa Study where GROUP 0 and GROUP 1 will be used in the launcher_config
exec melissa-launcher --config config_slurm_global.json