What is it about?
Tensor transposition is a fundamental operation in tensor calculations with various applications. However, a naive implementation that copies each element from the source tensor to the transposed position in the target tensor requires double space, making it unsuitable for large-scaled tensors on memory-limited accelerators, like Graphic Processing Units (GPUs). In this paper, we propose an algorithm and its implementation, called EITHOT, for In-place Transposition of High Order Tensors on GPUs, which requires only 5% additional memory at most for large high order tensors. To achieve this, EITHOT uses a newly proposed method, called permutation decomposition, to factorize a transposition of a high-order tensor into a sequence of low-order tensor transpositions. Then, based on the estimated extra memory requirements, EITHOT divides a large tensor into smaller tensors and transposes each smaller tensor separately. Finally, the transposed smaller tensors are combined to form the desired result. The GPU implementation optimizes memory access performance using the cooperative groups programming model. Our experiments demonstrate that EITHOT delivers competitive performance compared to the state-of-the-art out-of-place GPU implementations. Furthermore, EITHOT can handle nearly double the size of tensors compared to out-of-place methods, making it suitable for various transpositions of N-order tensors.
Featured Image
Read the Original
This page is a summary of: EITHOT: Efficient In-place Transposition of High Order Tensors on GPUs, ACM Transactions on Parallel Computing, January 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3711871.
You can read the full text:
Contributors
The following have contributed to this page