Server Class¶
melissa.server.base_server.BaseServer¶
Bases: ABC
BaseServer class that handles the following tasks:
- Manages connections with the launcher and clients.
- Generates client scripts for simulations.
- Encodes and decodes messages between the server and clients.
- Provides basic checkpointing functionality to save and restore states.
Parameters¶
- config_dict (
Dict[str, Any]): A dictionary containing configuration settings for initializing the server. - checkpoint_file (
str, optional): The filename for the checkpoint file (default is"checkpoint.pkl"). This file is used for saving and restoring the server's state.
Attributes¶
- comm (
MPI.Intracomm): The MPI communicator for inter-process communication. - rank (
int): The rank of the current process in the MPI communicator. - comm_size (
int): The total number of server processes in the MPI communicator. - client_comm_size (
int): The total number of client processes. - server_processes (
int): Synonym forcomm_size. - connection_port (
int): The server port to establish request-response connection with the clients. -
data_puller_port (
int): The server port to establish data pulling with the clients. -
_offline_mode (
bool): Internal flag indicating offline mode where no sending operation takes place. Useful when running multiple clients to produce datasets. Callself.make_study_offlineto enable. - _learning (
int): Internal flag indicating the learning state (initially 0). - __t0 (
float): The timestamp marking the initialization time of the object. - _job_limit (
int): Maximum number of jobs the launcher can manage concurrently. -
__is_direct_scheduler (
bool): Flag indicating whether the study is using a direct scheduler. -
_restart (
int): Flag indicating if the system is in a restart state; initialized from theMELISSA_RESTARTenvironment variable. - _consistency_lock (
threading.RLock): Reentrant lock to ensure thread-safe operations on shared resources. - _is_receiving (
bool): Flag indicating whether data reception is ongoing. - _is_online (
bool): Flag indicating if the system is in an online operational mode. - _sobol_op (
bool): Flag indicating whether Sobol operations are being performed. - _total_bytes_recv (
int): Tracks the total number of bytes received over the network. -
_active_sim_ids (
set): Set of active simulation ids currently being managed. -
_groups (
Dict[int, Group]): Dictionary mapping group ids toGroupobjects. - _parameter_sampler (
Optional[BaseExperiment]): Sampler for generating parameter values for simulations. -
__parameter_generator (
Any): Internal generator object for producing parameters. -
verbose_level (
int): Determines the verbosity level for logging and debugging output. - config_dict (
Dict[str, Any]): Configuration dictionary provided during initialization. -
checkpoint_file (
str): File name used for storing checkpoint data. -
crashes_before_redraw (
int): Number of simulation crashes allowed before redrawing parameters. - max_delay (
Union[int, float]): Maximum allowed delay for simulations, in seconds. - rm_script (
bool): Indicates whether client scripts should be removed after execution. - group_size (
int): Number of simulations grouped together for batch processing. -
zmq_hwm (
int): High-water mark for ZeroMQ communication. -
fields (
List[str]): List of field names used in the study. - nb_parameters (
int): Number of parameters in the parameter sweep study. - nb_time_steps (
int): Number of time steps in each simulation. -
nb_clients (
int): Total number of clients participating in the parameter sweep study. -
nb_groups (
int): Total number of groups, derived from the number of clients and group size. - nb_submitted_groups (
int): Tracks the number of groups submitted so far. finished_groups (set): Tracks the finished set of groups. -
mtt_simulation_completion (
float): Iteratively keeps track of mean of simulation durations. -
no_fault_tolerance (
bool): Indicates whether fault tolerance is disabled, based on theMELISSA_FAULT_TOLERANCEenvironment variable. - __ft (
FaultTolerance): Fault tolerance object managing simulation crashes and retries.
time_steps_known
property
¶
Time steps are known prior study or not.
is_direct_scheduler
property
¶
Study is using a direct scheduler or not.
learning
property
¶
Deep learning activated? Required when establishing a connection with clients.
consistency_lock
property
¶
Useful for active sampling.
__loop_pings()
¶
Maintains communication with the launcher to ensure it does not assume the server has become unresponsive.
__initialize_ports(connection_port=2003, data_puller_port=5000)
¶
Assigns port numbers for connection and data pulling as class attributes. If the specified ports are already in use, likely due to multiple servers running on the same node, the function attempts to find available ports by incrementing the base port values and rechecking their availability.
Note: When multiple independent melissa-server jobs are running simultaneously
on the same node, there is a chance that a port may incorrectly appear as available,
leading to potential conflicts.
Parameters¶
- connection_port (
int, optional): The port number used for establishing the main connection (default is2003). - data_puller_port (
int, optional): The port number used for pulling data (default is5000).
Raises¶
FatalError: If no ports were found after given number of attempts.
__connect_to_launcher()
¶
Establishes a connection with the launcher and sends metadata about the study.
__setup_sockets()
¶
Sets up ZeroMQ (ZMQ) sockets over a given TCP connection port for communication.
__setup_poller()
¶
This method sets up the polling mechanism by registering three important sockets: - Data Socket: Handles data communication. - Timer Socket: Manages timing events. - Launcher Socket: Facilitates communication with the launcher.
__start_debugger()
¶
Launches the Visual Studio Code (VSCode) debugger for debugging purposes.
configure_logger()
¶
Configures server loggers for each MPI rank.
initialize_connections()
¶
Initializes socket connections for communication.
set_parameter_sampler(sampler_t, **kwargs)
¶
Sets the defined parameter sampler type. This dictates how parameters are sampled for experiments. This sampler type can either be pre-defined or customized by inheriting a pre-defined sampling class.
Parameters¶
- sampler_t (
Union[ParameterSamplerType, Type[ParameterSamplerClass]]):ParameterSamplerType: Enum specifying pre-defined samplers.Type[ParameterSamplerClass]: A class type to instantiate.
- kwargs (
Dict[str, Any]): Dictionary of keyword arguments. Useful to pass custom parameter as well as strict parameter such asl_bounds,u_bounds,apply_pick_freeze,second_order,seed=0, etc.
__generate_client_script(sim_id, parameters, script_path)
¶
Generates a single client script for a given simulation id and parameters.
Parameters¶
- sim_id (
int): The simulation id associated with the client script. - parameters (
list): The list of parameters. - script_path (
str): The absolute path of the client script to create.
poll_sockets(timeout=10)
¶
Performs polling over the registered socket descriptors to monitor various events, including timer, launcher messages, new client connections, and data readiness.
Parameters¶
- timeout (
int, optional): The maximum time (in seconds) to wait for a socket event before returning. Default is10seconds.
Returns¶
Optional[Union[ServerStatus, SimulationData, PartialSimulationData]]:ServerStatusif the event is related to server status.SimulationDataif new simulation data is received.PartialSimulationDataif partial data from a simulation is received.
__forceful_group_termination(group_id)
¶
Forcefully terminates all clients in a group.
__handle_timerfd()
¶
Handles timer messages.
__handle_failed_group(group_id)
¶
Handles failed group by using fault-tolerance to decide resubmission.
close_connection(exit_=0)
¶
Signals to the launcher that the study has ended with a specified exit status.
Parameters¶
exit_(int, optional): The exit status code to be sent to the launcher. Defaults to0, indicating successful completion.
get_memory_info_in_gb()
¶
Returns a Tuple[float, float] containing memory consumed and
the total main memory in GB.
setup_environment()
¶
Optional. A method that sets up the environment or initialization.
Any necessary setup methods go here.
For example, Melissa DL study needs dist.init_process_group to be called.
__deserialize_message(msg)
¶
__process_simulation_completion(simulation, force=False)
¶
Finalizes simulation completion and adjusts metadata associated with it.
Parameters¶
- simulation (
Simulation): Instance of the simulation to finalize. - force (
bool): Set to enforce termination, regardless. Default isFalse.
__determine_and_process_simulation_data(simulation_data)
¶
start()
abstractmethod
¶
The high level organization of server events. Unique to melissa flavors.