Monitoring Melissa and Analyzing Results¶
This tutorial explains how to effectively monitor a Melissa study during and after its execution.
Melissa Logs¶
Melissa generates separate log files for each component (e.g., launcher, server, clients). These logs are written to the study's output directory:
-
melissa_launcher.log
: This log contains information from the launcher. It notifies the user of any client/server job failures and whether the study was successful. The verbosity level of this log can be adjusted using theverbosity
option in thelauncher_config
section of the configuration file. -
melissa_server_<rank>.log
: A separate log file is created for each rank of the server. These logs capture any issues during the processing of client data. The verbosity level for these logs is controlled by theverbosity
option in thestudy_options
section of the configuration file. -
Client Logs: Client logs correspond to the standard output (
*.out
) and error (*.err
) files of the client jobs. The log filenames vary based on the scheduler and their uniqueuid
.
Note
All standard output and error files are stored in the stdout/
folder located within the study's output directory.
Using the melissa-monitor
Command¶
The melissa-launcher
includes a REST API (details here), and melissa-monitor
leverages this API to automatically display job statuses directly in the cluster terminal as plots.
Prerequisites¶
Before using melissa-monitor
, ensure that the required dependency is installed:
How to use¶
- Start
melissa-launcher
as usual - Once launched, you will see a header printed in the terminal, which looks like this:
>>> melissa-launcher --config config_mpi.json
+----------------------------------------------------------+
| |
| ______ _____________________________________________ |
| ___ |/ /__ ____/__ /____ _/_ ___/_ ___/__ | |
| __ /|_/ /__ __/ __ / __ / _____ \_____ \__ /| | |
| _ / / / _ /___ _ /____/ / ____/ /____/ /_ ___ | |
| /_/ /_/ /_____/ /_____/___/ /____/ /____/ /_/ |_| |
| |
+----------------------------------------------------------+
Current outputs directory: /home/abhishek/Projects/REFACTORING/examples/heat-pde/heat-pde-dl/STUDY_OUT
Monitor melissa runs using the following command in a different terminal:
melissa-monitor --http_bind=0.0.0.0 \
--http_port=8888 \
--http_token=t17WX2ONhltPhPtieCEqDw \
--output_dir=/home/abhishek/Projects/REFACTORING/examples/heat-pde/heat-pde-dl/STUDY_OUT
The user can follow the provided instructions, by opening a new terminal, sourcing melissa_set_env.sh
and pasting the recommended command. This will start continuous job status tracking with output that looks like this:
The default settings for melissa-monitor
can be configured in the launcher_config
section of the configuration JSON file. The available controls include:
bind
: The network address to bind to (default:0.0.0.0
).http_port
: The port used by the REST API (default:8888
).http_token
: A secure token for authentication (default: an automatically generated unique 16-character token).
"launcher_config": {
...
"bind": "0.0.0.0",
"http_port": 8888,
"http_token": "I-2pqnkgVNfdR3U-wUiUbw",
...
}
Warning
If users are running multiple studies on the same node, they must manually assign a unique port number for each study. Failing to do so will result in a study failure due to port conflicts.
Note
If users are running on an EAR enabled cluster, they can activate real-time power monitoring (via eacct
) with the --report_eacct_metrics
flag.
Using the Melissa REST API¶
The melissa-launcher
provides a REST API for tracking job statuses. By default, the base URL is set to the local host (accessible to external connections) on port 8888
.
Configuring the Host and Port¶
- The host URL can be customized using the
bind
parameter in thelauncher_config
section of the configuration file. - Similarly, the port number can be changed by modifying the
http_port
parameter in the same section.
Available Endpoints¶
The REST API supports the following endpoints:
-
/jobs
: Returns a list of all job numbers known to the launcher. -
/jobs/<job_number>
: Returns the status of a specific job (<job_number>
). The possible job statuses include:WAITING
RUNNING
TERMINATED
ERROR
Example Usage¶
An example of how to use the REST API with the Python requests
library can be found in:
melissa/launcher/monitoring/terminal_monitor.py
.
The following commands demonstrate how to retrieve job information: