Mixed precision algorithms

Summary

The plateauing of the number of transistors on CPUs has signaled the end of the Moore’s law and necessitated the development of parallel architectures. In recent years we have observed that the performance of memory units has lagged the performance of computing units. While these factors have not yet had an enormous effect in Dense linear algebra where the majority of operations are compute bound, the effects in Sparse linear algebra have been clear due to the inherent involvement of memory bound operations.

With memory movement becoming more expensive, a lot can be gained by reducing/optimizing the memory traffic, in particular for many-core computing units such as general purpose GPUs. In that direction, mixed precision algorithms have become of particular interest, not only due to their capabilities to reduce the pressure of memory accesses, but also due to special function units such as Tensor cores, for example of NVIDIA GPUs , which can perform low precision computations are break-neck speeds.

In recent years, there has been a lot of interest in the development of mixed and multi-precision algorithms that can leverage these special function units, while providing the same quality of results. A survey of recent algorithms and implementations can be found in the survey. Another excellent reference is Nick Higham’s book, which focuses on theoretical development of mixed and multi-precision algorithms.

Mixed Precision algorithms in Ginkgo

Within Ginkgo, mixed precision algorithms are of particular interest. The modular and templated nature of Ginkgo, allows for easy development and testing of mixed precision algorithms. In particular, Ginkgo currently has the following mixed precision capabilities:

Mixed precision block-jacobi: The block Jacobi preconditioner is one of the simplest and cheapest preconditioners available. Ginkgo’s block Jacobi preconditioner provides the user with the option of storing the different blocks in different precisions based on their condition number, thereby saving memory and has been shown to be well-performant in a wide array of situtations. [Reference]
Mixed precision ISAI: The incomplete sparse approximate inverse preconditioners are general preconditioners that are useful for symmetric and non-symmetric linear systems. In Ginkgo, users have the option of using a mixed precision version of the ISAI preconditioner.
Compressed basis GMRES: The GMRES algorithm is well known for its generality and its residual minimization property. At the heart of the algorithm lies the orthogonalization, which in many cases forms the bottle neck due to its synchronization and memory constraining nature. Two main popular ways to orthogonalize bases are the Classical Gram-Schmidt (CGS) and a more stable Modified Gram-Schmidt (MGS) methods, with possible re-orthogonalizations if necessary. With Ginkgo’s CB-GMRES solver, the krylov bases are stored in lower precision(called memory/storage precision), but the computations are performed in higher, full precision (called arithmetic precision). This minimizes the effects of round-off errors and provides us with a well-performant GMRES solver.

To allow for generic implementations of mixed precision algorithms and to leverage this concept of separating the memory/storage precision from the arithmetic/compute precision, Ginkgo implements the accessor concept. The accessor behaves as a filter/transformer that can transparently convert, transform, scale or compress the underlying data. It has been designed so that the overhead of these operations is minimal.