NVIDIA Nsight Systems GPU profiling

AI Performance Engineering Meetup

ML-Ops
meetup
GPU profiling
Author

Oren Bochman

Published

Monday, January 19, 2026

Keywords

artificial intelligence, autonomous agents, AI agents, technology trends, future of AI, intelligent agents, machine learning, AI applications, AI research, context platform engineering, GPU profiling, NVIDIA Nsight, PyTorch optimization, AI infrastructure, performance engineering

The AI Performance Engineering Meetup

Diving deep into NVIDIA Nsight Systems GPU profiling tools for PyTorch LLM and computer vision workloads

These cover two issues that are familiar to anyone who took a course where one builds and trains deep learning models

  1. Moving the data to and from the GPU is much slower than the speed it is processed within the GPU. This is so slow that if the number of Matrix operations isn’t very big it may faster to run the model on the CPU than on the GPU! Anyone familiar with Flash attention and the mamba architecture will realize that this is not a trivial issue as even within the GPU there are multiple levels of memory with different speeds and latencies.
  2. Early books on deep learning discussed how batching can speed up training and that it is the main innovation for Stochastic Gradient Descent. It was recognized early that large batches with samples that are dissimilar lead to faster convergence. On the other hand it is also a challenge to fit batches of data with differing sizes into the memory of the GPU. Padding is one solution but it leads to waste of memory and compute. Dynamic batching is another solution but it leads to overheads in managing the batches. In inference workloads this is particularly challenging as the requests come in at different times and with different sizes.

These two earlier talks discuss how to use the PyTorch Profiler to deal with these two issues.

there are a couple of previous videos by Chaim Rand on AI Performance Engineering Meetup

Modern AI model training often relies on powerful and expensive machinery like GPUs. Performance optimization is a key tool in accelerating model convergence and reducing development costs. Fortunately, you do not need to be a CUDA expert in order to introduce meaningful improvements to your training efficiency. In this talk, we will motivate the inclusion of performance analysis and optimization in your AI model development and demonstrate a few simple tricks and techniques for accelerating your model. c.f. slides

c.f. slides

Since life is short I’ve put the videos for the first two parts in the margin.

Chaim is a great content creator and speaker. However with so many articles and videos it is hard to find them all. I’ve linked a few resources below. Also some point are more fundamental and get repeated in different talks.

  • AI/ML developers must take responsibility for the runtime performance of their workloads
  • You don’t need to be a CUDA expert to see results.

Optimization Methodology

  • Objective:- Maximize throughput (samples per second)
  • Use performance profilers to measure resource utilization and identify bottlenecks
  • Integrate into model development lifecycle
    1. Adding reporting to the model’s code so that the profilers can report performance metrics as part of the model’s training and inference processes. This is often just a matter of adding one or two lines of code to the model’s training and inference scripts.
    2. Using the profiler’s visualization tools to analyze the performance data and identify bottlenecks. There is two common undercurrent in software engineering that
      • “if you can’t measure it you can’t improve it”.
      • “avoid early optimization” Data scientist often end up waiting while the model trains and then only look at the final results. But the lesson here is that perhaps we should be measuring performance as part of the model development lifecycle prior to running any big training runs as this could save us time and money. The culprit is that most notebooks used to teach us about deep learning/ML/AI do not cover this side of the craft focusing just on a single metric (e.g. precision)

A Suggested recipe

  1. Profile (use a profiler e.g. PyTorch Profiler or NVIDIA Nsight to identify bottlenecks)
  2. Optimize i.e. Fix bottlenecks using optimization techniques such as caching, data loading, memory management, kernel fusion, mixed precision training, or many of the other techniques discussed in the talks and articles.
  3. Repeat until the GPU is well utilized (i.e. no red blocks or gaps appear in the profiler) and or other performance goals are met (e.g. latency, throughput, cost etc).

There are a number of steps in a typical AI/ML workload and a principled approach to optimization is to profile each step and find the bottlenecks. In the previous talks on the pytorch profiler Chaim laid out his optimization methodology.

perhaps most useful is his youtube channel with previous talks in this series:

says devs/ and data scientists should be resource that they are utilizing thier GPUS (i.e. profiling)


This talk shows shots of the Nvidas Nsight GPU profiler.

We start with images of the profile in a bad way - red blocked or idle blocks!

do 1-3 intervention and remove the red in the profiler getting 2-4x speedup.

the profiling relies on using NVTX block annotations

Resources

Reflection & Some questions

  • The long series covers the pytorch optimizer + tensorboard tool for visualization. It looks like tensorboard has been abandoned by google or at least deprecated in favor of other tools. Are there any other tools that one should be using today?

Citation

BibTeX citation:
@online{bochman2026,
  author = {Bochman, Oren},
  title = {NVIDIA {Nsight} {Systems} {GPU} Profiling},
  date = {2026-01-19},
  url = {https://orenbochman.github.io/posts/2026/2026-01-20-NVIDIA-Nsight-Systems-GPU-profiling/},
  langid = {en}
}
For attribution, please cite this work as:
Bochman, Oren. 2026. “NVIDIA Nsight Systems GPU Profiling.” January 19, 2026. https://orenbochman.github.io/posts/2026/2026-01-20-NVIDIA-Nsight-Systems-GPU-profiling/.