TensorBoard logging for Deep Learning

For leveraging the default TensorBoard logger, either tensorflow, or torch module must be installed.

Warning

If you are using a different framework. It is preferred that you at least have tensorflow-cpu installed locally.

Logging within a server class

Users are encouraged to use the built-in TensorBoard logging feature designed to help users more easily monitor and post-process their deep-learning studies.

As exemplified in examples/heat-pde/heatpde_server.py, the TensorBoard logger is available anywhere in the custom server class under the method self.tb_logger.log_*. Following methods are available to the user,

self.tb_logger.log_scalar("Loss/train", batch_loss, batch_idx)
self.tb_logger.log_scalars("Metrics", metrics_dict, batch_idx)
self.tb_logger.log_figure("Plots/metric", metric_plot_fig, batch_idx)
self.tb_logger.log_histogram("Histograms/dist", dist, batch_idx)

Note

If users want more flexibility, they can access SummaryWriter object through self.tb_logger.writer attribute.

TensorBoard allows you to monitor these values in real-time. To start, open a new terminal and run:

tensorboard --logdir melissa/examples/heat-pde/output_dir/tensorboard

By default, this launches a server at http://localhost:6006. You can now track the training progress in real-time by accessing the TensorBoard dashboard.

image

Melissa makes use of the TensorBoard logger for a variety of other metrics including:

Metric Description Scope
samples_per_second Average number of samples trained per second Local to MPI rank
buffer_size Size of the buffer at a given time Local to MPI rank
put_time Time spent to put each sample into the buffer Local to MPI rank
get_time Time spent to get each sample from the buffer Local to MPI rank

Additionally, get_buffer_statistics method is implemented in examples/heat-pde/heat-pde-dl/heatpde_dl_server.py to record,

Metric Description
buffer_std/{param} Standard deviation of {param} in the buffer
buffer_mean/{param} Mean of {param} in the buffer

Deeper post-processing

Users have the option of automatically generating a pandas dataframe from the TensorBoard logs via a configuration flag convert_log_to_df. By default, it is not set. The dataframe contains all information logged by the function self.tb_logger.log_scalar*.

The following is an example dl_config for users who wish to generate a dataframe from their TensorBoard logs:

{
    "dl_config": {
        "convert_log_to_df": true
    },
}

Warning

This function requires an additional installation of pandas and tensorflow, which can both be installed via pip with pip install pandas tensorflow. These are, by default, added in deep learning requirements.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_pickle("STUDY_OUT/tensorboard/data_rank_0.pkl")
train_loss_df = df[df.name == "Loss/train"]

plt.figure()
train_loss_df.value.plot()
plt.xlabel("# Batch")
plt.yscale("log")
plt.ylabel("Loss")
plt.show()