Monitoring Melissa and analyzing your results

This tutorial provides the user with information on how to efficiently monitor a Melissa study during and after execution.

Melissa Logs

In the current version of Melissa, each component (i.e. launcher, server, clients) has its own log file(s) written in the output directory of the study:

  • melissa_launcher.log - this file contains all information written by the launcher. It makes the user aware of any client/server job failure and whether the whole study was successful or not. The level of verbosity of this log is set with the verbosity option in the launcher_config section of the configuration file.
  • melissa_server_<rank>.log - this file is generated for each rank of the server. It makes the user aware of any problem occurring in the processing of the clients data. This level of verbosity of these logs is set with the verbosity option in the study_options section of the configuration file.
  • client logs - these correspond to the standard output/error of the client jobs. Their name depend on the scheduler and their uid (see Launching and debugging a use-case). Besides the potential error messages resulting from a client execution, the error output contains information printed by the Melissa API. On the other hand, the standard output contains information produced by the data generator.

Note

All standard output/error files are saved in the stdout folder located inside the output directory of the study.

In addition to these files, the deep-learning server writes all training related information to a tensorboard log. This one is discussed at the end of this page.

Job monitoring on a cluster

Batch schedulers provide functions to monitor job progress. In OAR, there is a convenient command called oarstat -u. The user can monitor the status of all their jobs via:

watch -n 5 oarstat -u

On slurm, the user can use squeue --user=username in the command:

watch -n 5 squeue --user=<username>

Note

The -n flag specifies the refresh frequency in seconds. This parameter helps keep the frontend from becoming saturated so that other users can use it without performance degradation.

Warning

It is sometimes recommended to use refresh frequency of the order of the minute.

Using the melissa-monitor command

melissa-launcher also provides a REST API (details here). melissa-monitor builds on the REST API to automatically plot the job statuses to the cluster terminal. To use this module, the user should first install the dependency pip3 install plotext. The user can start melissa-launcher as normally. They will notice a header printed to the terminal that looks like this:

user@flyon:~/melissa/melissa$ melissa-launcher --config_name config_oar_cpu

$! -------------------------------------------------$!
  __  __ ______ _      _____  _____ _____          
 |  \/  |  ____| |    |_   _|/ ____/ ____|  /\      
 | \  / | |__  | |      | | | (___| (___   /  \     
 | |\/| |  __| | |      | |  \___ \\___ \ / /\ \   
 | |  | | |____| |____ _| |_ ____) |___) / ____ \  
 |_|  |_|______|______|_____|_____/_____/_/    \_\

$! -------------------------------------------------$!

Access the terminal-based Melissa monitor by opening a new terminal and executing:

melissa-monitor --http_bind=0.0.0.0 --http_port=8888 --http_token=I-2pqnkgVNfdR3U-wUiUbw --output_dir=/home/user/melissa/melissa/examples/lorenz/STUDY_OUT 

The user can follow the provided instructions, by opening a new terminal, sourcing melissa_set_env.sh and pasting the recommended command. This will start continuous job status tracking with output that looks like this:

image

The default settings can be controlled in the launcher_config of the configuration json file. Available controls include: bind (default 0.0.0.0), http_port (default 8888), http_token (defaults to an automatically generated unique 16 char safe token):

    "launcher_config": {
        "scheduler": "oar",
        "num_server_processes": 1,
        "num_client_processes": 1,
        "bind": "0.0.0.0",
        "http_port": 8888,
        "http_token": "I-2pqnkgVNfdR3U-wUiUbw",
        "scheduler_arg_server": [
            "walltime=01:00:00"
        ],
        "fault_tolerance": false,
        "client_executable": "/path/to/melissa/examples/lorenz/lorenz.py",
        "verbosity": 1
    }

Note

If users are running on an EAR enabled cluster, they can activate real-time power monitoring (via eacct) with the --report_eacct_metrics flag.

Tensorboard logging for Deep Learning

Users are encouraged to use the built in Tensorboard logging feature designed to help users more easily monitor and post-process their deep-learning studies. As exemplified in examples/heat-pde/heatpde_server.py, the Tensorboard logger is available anywhere in the custom server class under the method self.tb_logger.log_scalar. For example, the loss during training can be logged by embedding the following line inside the training loop:

self.tb_logger.log_scalar("Loss/train", loss.item(), batch)

Tensorboard makes it easy to monitor these values in real-time by opening another terminal and executing:

tensorboard --logdir melissa/examples/heat-pde/output_dir/tensorboard

Which will start a server on the localhost at port 6006. Thus, typing http://localhost:6006 into a browser will open up the Tensorboard dashboard where users can monitor the training progress, as well as any additional data they chose to log using self.tb_logger.log_scalar:

image

Melissa makes use of the tensorboard logger for a variety of other metrics including:

  • samples_per_second: Average number of samples trained per second (logged n_batch_update frequency).

  • buffer_size: Size of the buffer at given time (logged on each sample get).

  • put_time: Time spent to put each sample into the buffer.

  • get_time: Time spent to get each sample from the buffer.

  • put_get_inc: Metric aimed at showing balance of puts and gets (puts add 1, gets subtract 1).

  • buffer_std/{param}: The standard deviation of {param} in the buffer (only active if get_buffer_statistics is set to true in dl_config). This value also requires customization for custom parameters (see examples/heat-pde/heat-pde-dl/heatpde_dl_server.py for an example.

  • buffer_mean/{param}: The mean of {param} in the buffer (only active if get_buffer_statistics is set to true in dl_config). This value also requires customization for custom parameters (see examples/heat-pde/heat-pde-dl/heatpde_dl_server.py for an example).

Deeper post-processing

Users have the option of automatically generating a pandas dataframe from the tensorboard logs via a configuration flag convert_log_to_df. By default, this flag is set to false, but when it is activated, it will save a dataframe pickle to the output_dir/tensorboard directory. The dataframe will contain all information logged by the function self.tb_logger.log_scalar() (including all additional metrics listed in Tensorboard logging).

The following is an example dl_config for users who wish to generate a dataframe from their Tensorboard logs:

    "dl_config": {
        ...
        "convert_log_to_df": true
    },

Note

This function requires an additional installation of pandas and tensorflow, which can both be installed via pip with pip install pandas tensorflow

Users can then access each individual metric using pandas filtering mechanisms such as:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_pickle("data.pkl")

puts = df[df.name == "put_time"]
gets = df[df.name == "get_time"]
buffer = df[df.name == "buffer_size"]

fig, ax = plt.subplots()
ax2 = ax.twinx()

ax.plot(puts["step"], puts["value"], alpha=0.5, label="puts", color="tab:blue")
ax.plot(gets["step"], gets["value"], alpha=0.5, label="gets", color="tab:orange")
ax2.plot(buffer["step"], buffer["value"], alpha=0.5, label="put get metric", color="tab:red")

ax.set_yscale("log")
ax.set_ylabel("Time (s)")
ax2.set_ylabel("Buffer size (N)")
ax.set_xlabel("Sample number")
ax.legend(loc="upper left")
ax2.legend(loc="upper right")
plt.savefig("puts_gets.png")

Using the Melissa REST API

The melissa-launcher includes a REST API for tracking job statuses. The base URL is by default the local host (accepting exterior connections) on port 8888:

'http://0.0.0.0:8888/'

But the host URL can be easily controlled in the launcher_config with the bind parameter. Similarly, the port number can be changed using the http_port parameter in the launcher_config.

melissa-launcher supports two endpoints, /jobs and /jobs/<job_number>. Where, /jobs returns a list of all job numbers known the the launcher. Meanwhile, /jobs/<job_number> returns specific job status of <job_number>, which may include WAITING, RUNNING, TERMINATED, and ERROR.

An example of how to use the REST API with the python requests library is provided in melissa/launcher/monitoring/terminal_monitor.py (details of script usage shown here). Namely, the job information can be retrieved with the following two commands:

import requests

# get the 
response = requests.get(f'http://127.0.0.1:8888/jobs', headers={'token':'I-2pqnkgVNfdR3U-wUiUbw'}).json()

# response
# {'jobs': [0,1,2,3,4]}

job_dict = requests.get('http://127.0.0.1:8888/jobs/1', headers={'token':'I-2pqnkgVNfdR3U-wUiUbw'}).json()

# job_dict
# {
#   'id': 9995, # oarid
#   'unique_id': 1,
#   'state': 'RUNNING'
# }