Skip to content

Resampling Workflow

The ExperimentalDeepMelissaActiveSamplingServer is designed to manage active sampling in a distributed environment using MPI (Message Passing Interface). The server's flow is centered around the divergence of MPI ranks into two distinct roles, training ranks and a dedicated breeding rank.

This divergence allows the server to efficiently handle resampling tasks while continuing training simulations in parallel.

Initialization Phase

The server is initialized with active sampling parameters, thresholds, and other settings. During initialization, the server validates the configuration and sets up key attributes such as ac_nn_threshold, ac_sim_threshold, and ac_min_finished_simulations_threshold. These attributes govern the conditions under which resampling is triggered and ensure that the server operates within the defined parameters.

To facilitate its operation, the server splits the MPI communicator (MPI.COMM_WORLD) into two distinct groups: training ranks and a breeding rank. The training ranks are responsible for handling the training of simulations, while the breeding rank is dedicated to managing resampling tasks. This separation is achieved through the __split_comm_for_training method, which ensures that the breeding rank is isolated from the training ranks, allowing it to focus exclusively on resampling operations.

Note

The breeding rank is dynamically determined at runtime based on the distribution of MPI ranks across multiple nodes. Typically, it is assigned to an additional local MPI rank in a distributed setup. For instance, in a cluster with 4 GPUs per node, if the study requires 3 nodes (12 GPUs), the server should be launched with 13 MPI ranks. This extra rank, allocated on one of the nodes, is designated as the breeding rank. The breeding rank plays a critical role in managing resampling tasks, including gathering statistics from training ranks, initiating resampling processes.

Training and Resampling Workflow

The divergence of MPI ranks is a key feature of the server, enabling parallel execution of training and resampling. The workflow is as follows:

A. Training Ranks

Training involves the training ranks recording batch-wise statistics. At the end of each batch, during the _on_batch_end phase, the server evaluates whether resampling should be triggered. This ensures that the system dynamically adapts based on the progress of the training process.

Periodic resampling checks are performed by the training ranks to determine if resampling conditions are met. These checks are based on predefined thresholds, such as ac_nn_threshold and ac_min_finished_simulations_threshold. When the conditions are satisfied, the training ranks synchronize with the breeding rank to prepare for the resampling process.

During synchronization with the breeding rank, the training ranks collect and send relevant statistics, such as delta losses and sliding windows, to the breeding rank. Once resampling is completed, the training ranks receive a success status from the breeding rank, confirming the outcome of the resampling. Importantly, if resampling is still ongoing, the training ranks are not blocked and continue executing their primary tasks without interruption.

B. Breeding Rank

The breeding rank plays a pivotal role in the resampling workflow by managing statistical gathering, triggering resampling, update parameters. It begins by gathering statistics from all training ranks using MPI collective operations, such as MPI.Gather. These statistics are unified and used to update the breeding rank's metadata before computing the fitnesses required for resampling.

Once the necessary statistics are processed, the breeding rank triggers the resampling process by invoking the active_sampling.trigger_sampling method. During this step, it calculates the number of simulations to breed, referred to as max_breeding_count, ensuring that the resampling process aligns with the defined thresholds and parameters.

After completing the resampling process, the breeding rank broadcasts the status of the current resampling to all training ranks. This ensures that the training ranks start using new parameters immediately.

The breeding rank also manages the resampling process to prevent overlapping resampling events by maintaining an ongoing_resample_event flag. Additionally, it handles exceptions that may occur during resampling and ensures proper cleanup in case of failures, maintaining the server's stability and consistency throughout the workflow.

Reactive Submission Strategy

To align with the new resampling approach, the job submission process has been updated (across all Melissa servers) to make it more reactive to the most recent resampling phase. Previously, Melissa would send explicit requests to launch all clients at the start of the study. On the launcher side, these requests were buffered to ensure that the number of running jobs did not exceed the allocated resources, defined as job_limit - 1.

In the updated strategy, submission requests are handled lazily, with a 1-by-1 submission approach. Initially, the server submits job_limit - 1 client processes. As soon as a client finishes, the server immediately requests the launcher to start the next client. This ensures that the number of running client processes remains within the job_limit - 1 constraint while allowing the server to dynamically adapt to the resampling process. This approach also provides the flexibility to determine the starting point or row ID where newly resampled parameters will be inserted into the parameter matrix.

During the study, the server continues to submit new requests as clients finish, ensuring a steady flow of submissions while training and resampling are ongoing. After all ranks synchronize following a resampling phase, the server temporarily halts the 1-by-1 submission process. It then determines the starting point for the newly resampled parameters and inserts them in the parameter matrix before resuming submissions from that point. This ensures that the latest resampled parameters are used for subsequent client submissions, making the process highly reactive and efficient.

Note

Additionally, client scripts now follow a standardized naming convention:

client.<resampling_step>(B).<simulation_id>.sh.

Here, indicates the resampling phase, and B specifies whether the client is using a proposed (bred) set of parameters. This naming scheme improves clarity and traceability in the submission process.