Skip to content

Rate Limiter for Online Training

In online training, HPC instabilities (queue noise, network jitter, node variability, transient slowdowns) can make the production/consumption balance fluctuate during the same run.

When this view/renewal ratio drifts, the effective sample age distribution inside the buffer changes over time, which can alter training dynamics in unwanted ways.

The DL rate limiter helps keep this ratio more controlled over time. In practice, this is useful to:

  • Control the flow of incoming samples,
  • Stabilize training by reducing abrupt shifts in data arrival speed,
  • Improve comparability across different experiments.

How it Works

When the rate-limiting mechanism is enabled, Melissa computes the target number of samples to be received for each processed batch () to maintain synchronization between the simulation speed and the deep learning update frequency:

Where:

  • (Total Time Steps): This represents the total number of time steps inferred from the study_options.nb_time_steps configuration parameter.

  • (Rate Limit Speed): This defines the target throughput of the system, expressed in batches per simulation. It acts as a throttle to ensure the DL server does not outpace the data generation rate.

Melissa applies a simple synchronization between receiver and trainer threads:

  • If too many samples arrived since the last batch, reception waits,
  • If not enough samples arrived for the target pace, training waits.

Note

To avoid slowing down startup unnecessarily, counting only starts after the local buffer reaches its watermark.

How to Enable

rate_limit_speed_batch_per_sim (strictly positive float)

{
    "dl_config": {
        "rate_limit_speed_batch_per_sim": 1.0
    }
}

Practical Guidance

  • Recommended calibration workflow:

    1. Run once without rate limiting.
    2. After some time, count how many batches were processed for a given number of simulations.
    3. Set rate_limit_speed_batch_per_sim accordingly with the observed ratio.
    4. Keep it for all you experiments you want to compare, even if some of them could be faster or slower to ensure comparability.
  • If the simulations are intrinsically slow, increase the number of clients when possible via launcher_config.job_limit such that the production stays sufficiently steady compared to consumption.

  • If the configuration is missing or invalid (non-positive value), Melissa ignores the limiter and continues with the default behavior.