Profiling CUDA kernels with Nsight Compute

Summary

With GPU computing now ubiquitous, it is important to focus on understanding kernel behaviour to maximize performance. NVIDIA has finally put in some effort to provide tools which make it very easy to profile and understand your CUDA kernel code. While there are a few tools available for NVIDIA, for profiling, the most important one for single kernel performance is the Nsight Compute. It has some really nice features including Roofline analysis, Memory and Compute Workload Analysis and more, which are easy to enable and reasonably easy to interpret. You can have a look at the different metrics that can be enabled in the Nsight documentation.

The SpMV kernel is one of the most important kernels in Sparse linear algebra. Most of the linear and non-linear solvers, preconditioners spend most of their time in this kernel and hence it is important to optimize and tune this kernel.

The repository below provides a few variants of the SpMV kernel:

Row-parallel.
Block parallel.
CuSparse algorithm.
Ginkgo classical algorithm.

You can pass in any arbitrary matrix in the matrix market format (for example from the SuiteSparse collection) and run the benchmark code.

An example run could look like:

> PROFILE_OPTIONS="--section SpeedOfLight --section Occupancy --section WarpStateStats --section ComputeWorkloadAnalysis --section MemoryWorkloadAnalysis --section SchedulerStats --section SourceCounters --section SpeedOfLight_RooflineChart"

> ncu ${PROFILE_OPTIONS} -o spmv-profiling -f /path/to/run_spmv \ --matrix=”path/to/mtx” --strategy=”block_parallel”

This will output some Nsight output files, which you can load into your Nsight UI (ncu-ui) to analyze your code.

Repository and resources.

A hands-on repository with profiled code and the kernels, profiling-hands-on is available on gitlab.

Slides

Some more detailed information regarding execution and options is available in the [slides].