The AI Performance Engineering Meetup
- This is a recap of an online meetup
- This meetup has been running for a while. There are many previous talks on the AI Performance Engineering Meetup - YouTube Channel
- The meetup is organized by Chris Fregly and others.
- Chris Fregly is
- the author of the book AI Performance Engineering and
- runs a blog on Medium.
- Maintains a GitHub repository with code examples from the book and code and slides from the talks. Though on first inspection the repo only covers a few of the talks
Diving deep into NVIDIA Nsight Systems GPU profiling tools for PyTorch LLM and computer vision workloads
In this talk, Chaim Rand revisits the NVIDIA Nsight profiling tools to augment the PyTorch Profiler.
This talk is based on Chaim’s recent articles series “A Deep Dive Using NVIDIA Nsight™ Systems Profiler”
These cover two issues that are familiar to anyone who took a course where one builds and trains deep learning models
- Moving the data to and from the GPU is much slower than the speed it is processed within the GPU. This is so slow that if the number of Matrix operations isn’t very big it may faster to run the model on the CPU than on the GPU! Anyone familiar with Flash attention and the mamba architecture will realize that this is not a trivial issue as even within the GPU there are multiple levels of memory with different speeds and latencies.
- Early books on deep learning discussed how batching can speed up training and that it is the main innovation for Stochastic Gradient Descent. It was recognized early that large batches with samples that are dissimilar lead to faster convergence. On the other hand it is also a challenge to fit batches of data with differing sizes into the memory of the GPU. Padding is one solution but it leads to waste of memory and compute. Dynamic batching is another solution but it leads to overheads in managing the batches. In inference workloads this is particularly challenging as the requests come in at different times and with different sizes.
These two earlier talks discuss how to use the PyTorch Profiler to deal with these two issues.
there are a couple of previous videos by Chaim Rand on AI Performance Engineering Meetup
Modern AI model training often relies on powerful and expensive machinery like GPUs. Performance optimization is a key tool in accelerating model convergence and reducing development costs. Fortunately, you do not need to be a CUDA expert in order to introduce meaningful improvements to your training efficiency. In this talk, we will motivate the inclusion of performance analysis and optimization in your AI model development and demonstrate a few simple tricks and techniques for accelerating your model. c.f. slides
c.f. slides
Since life is short I’ve put the videos for the first two parts in the margin.
Chaim is a great content creator and speaker. However with so many articles and videos it is hard to find them all. I’ve linked a few resources below. Also some point are more fundamental and get repeated in different talks.
- AI/ML developers must take responsibility for the runtime performance of their workloads
- You don’t need to be a CUDA expert to see results.
Optimization Methodology
- Objective:- Maximize throughput (samples per second)
- Use performance profilers to measure resource utilization and identify bottlenecks
- Integrate into model development lifecycle
- Adding reporting to the model’s code so that the profilers can report performance metrics as part of the model’s training and inference processes. This is often just a matter of adding one or two lines of code to the model’s training and inference scripts.
- Using the profiler’s visualization tools to analyze the performance data and identify bottlenecks. There is two common undercurrent in software engineering that
- “if you can’t measure it you can’t improve it”.
- “avoid early optimization” Data scientist often end up waiting while the model trains and then only look at the final results. But the lesson here is that perhaps we should be measuring performance as part of the model development lifecycle prior to running any big training runs as this could save us time and money. The culprit is that most notebooks used to teach us about deep learning/ML/AI do not cover this side of the craft focusing just on a single metric (e.g. precision)
A Suggested recipe
- Profile (use a profiler e.g. PyTorch Profiler or NVIDIA Nsight to identify bottlenecks)
- Optimize i.e. Fix bottlenecks using optimization techniques such as caching, data loading, memory management, kernel fusion, mixed precision training, or many of the other techniques discussed in the talks and articles.
- Repeat until the GPU is well utilized (i.e. no red blocks or gaps appear in the profiler) and or other performance goals are met (e.g. latency, throughput, cost etc).
There are a number of steps in a typical AI/ML workload and a principled approach to optimization is to profile each step and find the bottlenecks. In the previous talks on the pytorch profiler Chaim laid out his optimization methodology.
perhaps most useful is his youtube channel with previous talks in this series:
says devs/ and data scientists should be resource that they are utilizing thier GPUS (i.e. profiling)
This talk shows shots of the Nvidas Nsight GPU profiler.
We start with images of the profile in a bad way - red blocked or idle blocks!
do 1-3 intervention and remove the red in the profiler getting 2-4x speedup.
the profiling relies on using NVTX block annotations
Resources
Optimizing Data Transfer in AI/ML Workloads A Deep Dive Using NVIDIA Nsight™ Systems Profiler — Part 1
Optimizing Data Transfer in Batched AI/ML Inference Workloads A Deep Dive Using NVIDIA Nsight™ Systems Profiler — Part 2
Series on PyTorch Model Performance Analysis and Optimization
- How to Use PyTorch Profiler and TensorBoard to Accelerate Training and Reduce Cost
- How to Identify and Reduce CPU Computation In Your Training Step with PyTorch Profiler and TensorBoard
- How to reduce “Cuda Memcpy Async” events and why you should beware of boolean mask operations
- Solving Bottlenecks on the Data Input Pipeline with PyTorch Profiler and TensorBoard
- How to Optimize Your DL Data-Input Pipeline with a Custom PyTorch Operator
- PyTorch Model Performance Analysis and Optimization
- Efficient Metric Collection in PyTorch: Avoiding the Performance Pitfalls of TorchMetrics
- A Caching Strategy for Identifying Bottlenecks on the Data Input Pipeline
- Pipelining AI/ML Training Workloads With CUDA Streams
- The Role of NUMA Awareness
- Overcoming the Hidden Performance Traps of Variable-Shaped Tensors: Efficient Data Sampling in PyTorch
Reflection & Some questions
- The long series covers the pytorch optimizer + tensorboard tool for visualization. It looks like tensorboard has been abandoned by google or at least deprecated in favor of other tools. Are there any other tools that one should be using today?
- Weights and Biases as an alternative.
- Perfetto - an open-source project for performance instrumentation and tracing of Android, Linux, Chrome, and other platforms.
- Chrome via chrome://tracing c.f. How to use the Chrome UI to analyze Pytorch Profiler traces? c.f. Chrome Tracing ui:
Citation
@online{bochman2026,
author = {Bochman, Oren},
title = {NVIDIA {Nsight} {Systems} {GPU} Profiling},
date = {2026-01-19},
url = {https://orenbochman.github.io/posts/2026/2026-01-20-NVIDIA-Nsight-Systems-GPU-profiling/},
langid = {en}
}