2016.42: A Comparison of Potential Interfaces for Batched BLAS Computations
2016.42: Samuel D. Relton, Pedro Valero-Lara and Mawussi Zounon (2016) A Comparison of Potential Interfaces for Batched BLAS Computations.
Full text available as:
|PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader|
One trend in modern high performance computing (HPC) is to decompose a large linear algebra problem into thousands of small problems which can be solved indepen- dently. There is a clear need for a batched BLAS standard, allowing users to perform thousands of small BLAS operations in parallel and making efficient use of their hard- ware. There are many possible ways in which the BLAS standard can be extended for batch operations. We discuss many of these possible designs, giving benefits and criticisms of each, along with a number of experiments designed to determine how the API may affect performance on modern HPC systems. Related issues that influence API design, such as the effect of memory layout on performance, are also discussed.
|Item Type:||MIMS Preprint|
|Uncontrolled Keywords:||BLAS, batched BLAS, linear algebra, parallel computing, high-performance computing|
|Subjects:||MSC 2000 > 68 Computer science|
|Deposited By:||Dr Samuel Relton|
|Deposited On:||04 August 2016|