5 Tools to monitor GPU training

Monitor training on GPU

Author

Sjoerd de Haan

Published

May 16, 2025

Introduction

Training models takes experimentation, iteration — and watching your system like a hawk. If you’re not monitoring your GPUs and I/O closely, you’re probably wasting time (and compute).

There are always opportunities to improve model performance — but only if you catch them early in training. I run experiments across 5 GPUs on my local rig, constantly tweaking architectures and training setups. These tools help me spot bottlenecks before they waste hours.

5 system monitoring tools

Here are the 5 tools I use to monitor the system performance while training models.

Htop

A classic terminal tool that shows CPU usage per core, memory usage, and swap. I use htop first when optimizing data loading and CPU-bound preprocessing. Pro tip: Filter by command or user with F4 to isolate your training jobs.

Glances

When the bottleneck isn’t the CPU, it’s usually I/O. glances shows disk, network, memory, and even Docker stats — all in one dashboard.

Pro tip: Look at the event log (bottom) for hidden issues like memory overflows or OOM kills.

Figure 2: Glances eventlist reveals a memory overflow that must have triggered the termination of my training processes.

Nvidia-smi

The go-to tool for checking real-time GPU usage, memory, and temperature. nvidia-smi is great for a quick status check. Pro tip: Run `watch -n 0.5 n

Nvtop

Nvtop show life updating graphs of GPU usage over time. It reveals more about the usage patterns of training process compared to nvidia-smi, which displays instant usage.

In addition nvtop shows the names and user names of all processes that use a GPU.

Figure 3: Nvtop shows GPU usage over time.

More tools for system monitoring

Do use tensorboard or ml-flow

Once tried logging system metrics like GPU and memory usage to TensorBoard and MLflow. I figured: “Why not store the full picture?” Turns out — that was a bad idea.

These dashboards are great for tracking training metrics, but terrible for system monitoring. The UI is clunky, not optimized for quick glances, and it’s a hassle to configure system stats. Add to that a 10+ second lag, heavy log files, and noticeable overhead — and it’s just not worth it.

Tmux for session management

Training multiple models at once feels like playing simultaneous chess. A terminal multiplexer like tmux helps you switch between REPLs, diagnostics, and training jobs without losing your flow. Even better: it keeps everything running when you disconnect — perfect for long training runs on remote machines.

Pro tip: Use tmuxp to save and restore full tmux sessions.

Conclusion

Great training runs come from great setups. These tools help you catch bugs faster, track resource bottlenecks, and avoid silent failures.

Monitoring tools won’t make your model smarter — but they will make you smarter. If you train on your own hardware (or even on Colab/Kaggle), give these tools a try.

What’s your go-to tool for training-time debugging? Did I miss a favorite?