Monitoring Melissa and Analyzing Results

This tutorial explains how to effectively monitor a Melissa study during and after its execution.

Melissa Logs

Melissa generates separate log files for each component (e.g., launcher, server, clients). These logs are written to the study's output directory:

  • melissa_launcher.log: This log contains information from the launcher. It notifies the user of any client/server job failures and whether the study was successful. The verbosity level of this log can be adjusted using the verbosity option in the launcher_config section of the configuration file.

  • melissa_server_<rank>.log: A separate log file is created for each rank of the server. These logs capture any issues during the processing of client data. The verbosity level for these logs is controlled by the verbosity option in the study_options section of the configuration file.

  • Client Logs: Client logs correspond to the standard output (*.out) and error (*.err) files of the client jobs. The log filenames vary based on the scheduler and their unique uid.

Note

All standard output and error files are stored in the stdout/ folder located within the study's output directory.

Using the melissa-monitor Command

The melissa-launcher includes a REST API (details here), and melissa-monitor leverages this API to automatically display job statuses directly in the cluster terminal as plots.

Prerequisites

Before using melissa-monitor, ensure that the required dependency is installed:

pip3 install plotext

How to use

  • Start melissa-launcher as usual
  • Once launched, you will see a header printed in the terminal, which looks like this:
>>> melissa-launcher --config config_mpi.json 


    +----------------------------------------------------------+
    |                                                          |
    |  ______  _____________________________________________   |
    |  ___   |/  /__  ____/__  /____  _/_  ___/_  ___/__    |  |
    |  __  /|_/ /__  __/  __  /  __  / _____ \_____ \__  /| |  |
    |  _  /  / / _  /___  _  /____/ /  ____/ /____/ /_  ___ |  |
    |  /_/  /_/  /_____/  /_____/___/  /____/ /____/ /_/  |_|  |
    |                                                          |
    +----------------------------------------------------------+



Current outputs directory: /home/abhishek/Projects/REFACTORING/examples/heat-pde/heat-pde-dl/STUDY_OUT

Monitor melissa runs using the following command in a different terminal:

melissa-monitor --http_bind=0.0.0.0 \
        --http_port=8888 \
        --http_token=t17WX2ONhltPhPtieCEqDw \
        --output_dir=/home/abhishek/Projects/REFACTORING/examples/heat-pde/heat-pde-dl/STUDY_OUT

The user can follow the provided instructions, by opening a new terminal, sourcing melissa_set_env.sh and pasting the recommended command. This will start continuous job status tracking with output that looks like this:

image

The default settings for melissa-monitor can be configured in the launcher_config section of the configuration JSON file. The available controls include:

  • bind: The network address to bind to (default: 0.0.0.0).
  • http_port: The port used by the REST API (default: 8888).
  • http_token: A secure token for authentication (default: an automatically generated unique 16-character token).
    "launcher_config": {
        ...
        "bind": "0.0.0.0",
        "http_port": 8888,
        "http_token": "I-2pqnkgVNfdR3U-wUiUbw",
        ...
    }

Warning

If users are running multiple studies on the same node, they must manually assign a unique port number for each study. Failing to do so will result in a study failure due to port conflicts.

Note

If users are running on an EAR enabled cluster, they can activate real-time power monitoring (via eacct) with the --report_eacct_metrics flag.

Using the Melissa REST API

The melissa-launcher provides a REST API for tracking job statuses. By default, the base URL is set to the local host (accessible to external connections) on port 8888.

Configuring the Host and Port

  • The host URL can be customized using the bind parameter in the launcher_config section of the configuration file.
  • Similarly, the port number can be changed by modifying the http_port parameter in the same section.

Available Endpoints

The REST API supports the following endpoints:

  1. /jobs: Returns a list of all job numbers known to the launcher.

  2. /jobs/<job_number>: Returns the status of a specific job (<job_number>). The possible job statuses include:

    • WAITING
    • RUNNING
    • TERMINATED
    • ERROR

Example Usage

An example of how to use the REST API with the Python requests library can be found in:
melissa/launcher/monitoring/terminal_monitor.py.

The following commands demonstrate how to retrieve job information:

# get a list of all jobs
response = requests.get(f'http://127.0.0.1:8888/jobs', headers={'token':'I-2pqnkgVNfdR3U-wUiUbw'}).json()

# response
# {'jobs': [0,1,2,3,4]}
# get the status of a specific job
job_dict = requests.get('http://127.0.0.1:8888/jobs/1', headers={'token':'I-2pqnkgVNfdR3U-wUiUbw'}).json()

# job_dict
# {
#   'id': 9995, # oarid
#   'unique_id': 1,
#   'state': 'RUNNING'
# }