Skip to content

Metric Logging for Deep Learning

Logging within a server class

Users are encouraged to use the built-in TensorBoard or W&B logging feature designed to help users more easily monitor and post-process their deep-learning studies.

As exemplified in examples/heat-pde/heatpde_server.py, the logger is available anywhere in the custom server class under the method self.metric_logger.log_*. Following methods are available to the user,

self.metric_logger.log_scalar("Loss/train", batch_loss, batch_idx)
self.metric_logger.log_scalars("Metrics", metrics_dict, batch_idx)
self.metric_logger.log_figure("Plots/metric", metric_plot_fig, batch_idx)
self.metric_logger.log_histogram("Histograms/dist", dist, batch_idx)

TensorBoard Logging

By default, melissa server initializes a Tensorboard Logger instance.

Note

If users want more flexibility, they can access SummaryWriter object through self.metric_logger.writer attribute.

TensorBoard allows you to monitor these values in real-time. To start, open a new terminal and run:

tensorboard --logdir melissa/examples/heat-pde/output_dir/tensorboard

By default, this launches a server at http:/localhost:6006. You can now track the training progress in real-time by accessing the TensorBoard dashboard.

image

Melissa logs variety of metrics including:

Metric Description Scope
BufferStatistics/occupancy Percentage of the buffer currently in use (current_size/maxsize). Local to MPI rank
BufferStatistics/put_rate_items_per_sec Throughput of incoming simulation data. Local to MPI rank
BufferStatistics/get_rate_items_per_sec Throughput of data consumed by the trainer. Local to MPI rank
BufferStatistics/mean_seen Average number of times a sample is reused before being replaced. Local to MPI rank
BufferStatistics/max_seen The highest reuse count for any single sample currently in the buffer. Local to MPI rank
BufferStatistics/final_seen_distribution Histogram of sample-seen frequencies (logged once at the end of training for Reservoir-like buffers). Local to MPI rank

Additionally, log_buffer_mean_std() method is implemented in examples/heat-pde/heat-pde-dl/heatpde_dl_server.py to record,

Metric Description
BufferStatistics/std/{param} Standard deviation of {param} in the buffer
BufferStatistics/mean/{param} Mean of {param} in the buffer

W&B Logging

Follow the guide on W&B Quickstart

Similar to Tensorboard logger, to initialize the W&B logger. Simply set,

{
    "dl_config": {
        "wandb": true,
        "wandb_project": "project_name",
        "wandb_group": "group_name"
    }
}