Tensor Learning

High Performance Tensor Computing for Machine Learning

We developed efficient tensor libraries for tensor decompositions and tensor networks, including CP, Tucker, Hierarchical Tucker, tensor-train, tensor-ring, low-tubal-rank tensor decompositions, etc. We provide efficient primitives for tensor, Hadamard, Khatri-Tao products; contraction, matricization, tensor times matrix (TTM), matricized tensor times Khatri-Rao product (MTTKRP), on tensor cores. These operations are the key components of the tensor algebra.

E.g., cuTensor-tubal library adopts a frequency domain computation scheme. We optimize the data transfer, memory access, and support seven key tensor operations: t-FFT, inverse t-FFT, t-product, t-SVD, t-QR, t-inverse, and t-normalization. cuTensor-tubal library fully exploits the separability in the frequency domain and maps the tube-wise and slice-wise parallelisms onto the single instruction multiple thread (SIMT) GPU architecture.

COMPUTING THIRD-ORDER TENSORS ON GPUS

We briefly summarize its concept as well as basic and key operations, and introduce how to compute third-order tensor operations of this model on GPUs. Throughout this study, we focus on realvalue third-order tensors in the space R

DESIGN OF THE CUTENSOR-TUBAL LIBRARY

We design this library on top of existing highly optimized CUDA libraries including cuFFT `{`27`}`, cuBLAS `{`27`}`, cuSolver `{`27`}`, Magma `{`28`}`, and KBLAS `{`29`}` for efficient vector Fourier transform and matrix computations.

OVERVIEW OF THE LIBRARY

The cuTensor-tubal library consists of five layers: Applications; cuTensor-tubal API; Third-party tensor libraries; CUDA libraries; Hardware platform.

PERFORMANCE EVALUATION

We measure the running time and speedups of seven key tensor operations.We test tensor operation performance, and further test tensor completion and t-SVD-based video compression performance.

RELATED WORKS

Early works accelerate tensor computations primarily on single machines with multi-core CPUs or distributed CPU clusters. Later with the advent of the high-performance GPUs, more and more works adopt GPUs to accelerate intensively computational tensor computations.

CONCLUSION AND FUTURE WORK

We presented a cuTensor-tubal library of common tensor operations for low-tubal-rank tensor decomposition.In the future, we plan to extend the cuTensor-tubal library to include more tensor operations,and scale the library onto multi-GPU systems.

TensorLet Team

The achievement of cuTensor we did by now!

For tensor decompositions, our cuTensor library achieves speedups xxx.

Related Publications

[Book Chapter] X.-Y. Liu, Y. Fang, L. Yang, Z. Li, A. Walid. High-performance Tensor Decompositions for Compressing and Accelerating Deep Neural Networks. Tensors for Data Processing: Theory, Methods, and Applications. [Link] Elsevier; 2021 Nov 10.

[ICDCS] X.-Y. Liu, J. Zhang, G. Wang, Weiqin Tong, and Anwar Walid. Efficient Pretraining and Finetuning of Quantized LLMs with Low-rank Structure. IEEE ICDCS, 2024.

[TC] X.-Y. Liu, H. Hong, Z. Zhang, W. Tong, J. Kossaifi, X. Wang, and A. Walid. High-performance Tensor-Train Primitives Using GPU Tensor Cores. IEEE Transactions on Computers, 2024.

[TC] X.-Y. Liu, Z. Zhang, Z. Wang, H. Lu, X. Wang*, and A. Walid. High-performance tensor learning primitives using GPU tensor cores. IEEE Transactions on Computers, 2022.

[TC] H. Huang, X.-Y. Liu*, W. Tong, T. Zhang, A. Walid, and X. Wang. High performance hierarchical Tucker tensor learning using GPU tensor cores. IEEE Transactions on Computers, 2022.
[TPDS] T. Zhang, X.-Y. Liu*, X. Wang. High performance GPU tensor completion with tubal-sampling pattern. IEEE Transactions on Parallel and Distributed Systems, 2020.
[TPDS] T. Zhang, X.-Y. Liu*, X. Wang, A. Walid. cuTensor-tubal: Efficient primitives for tubal-rank tensor learning operations on GPUs. IEEE Transactions on Parallel and Distributed Systems, 2019.

[JPDC] T. Zhang, W. Kan, X.-Y. Liu*. High-performance GPU primitives for graph-tensor learning operations. Elsevier Journal of Parallel and Distributed Computing, 2021.
[HPCC] H. Li, T. Zhang, R. Zhang, X.-Y. Liu. High-performance tensor decoder on GPUs for wireless camera networks in IoT. IEEE HPCC 2019.
[HPCC] H. Lu, T. Zhang, X.-Y. Liu. High-performance homomorphic matrix completion on GPUs. IEEE HPCC 2019.
[ICASSP] X.-Y. Liu, T. Zhang (co-primary author). cuTensor-tubal: Optimized GPU library for low-tubal-rank tensors. IEEE ICASSP, 2019.
[NeurIPS Workshop] X.-Y. Liu, T. Zhang, H. Hong, H. Huang, H. Lu. High performance computing primitives for tensor networks learning operations on GPUs. Quantum Tensor Networks in Machine Learning Workshop at NeurIPS 2020.
[NeurIPS Workshop] X.-Y. Liu, Q. Zhao, J. Biamonte, A. Walid. Tensor, Tensor Networks, Quantum Tensor Networks in Machine Learning: An Hourglass Architecture. Quantum Tensor Networks in Machine Learning Workshop at NeurIPS 2020.
[NeurIPS Workshop] H. Hong, H. Huang, T. Zhang, X.-Y. Liu. High Performance single-site finite DMRG on GPUs. Quantum Tensor Networks in Machine Learning Workshop at NeurIPS 2020.
[IJCAI Workshop] X.-Y. Liu, H. Lu, T. Zhang. cuTensor-CP: High performance third-order CP tensor decompositions on GPUs. IJCAI 2020 Workshop on Tensor Network Representations in Machine Learning, 2020.
[ICCAD] C. Deng, M. Yin, X.-Y. Liu, X. Wang, B. Yuan. High-performance hardware architecture for tensor singular value decomposition (Invited paper). International Conference on Computer-Aided Design (ICCAD), 2019.
[IPCCC] J. Huang, L. Kong, X.-Y. Liu, W. Qu and G. Chen. A C++ library for tensor decomposition. International Performance Computing and Communications Conference (IPCCC), 2019.
W. Xu and X.-Y. Liu. Classical simulation of Sycamore quantum supremacy circuits using GPU tensor cores: A tensor network approach. ACM Student Research Competition at IEEE ICCAD 2022.

Reach the top ending AI science!

A young team, professional in GPU tensor and Deep Learning technology, commits to creating top AI algorithms and solutions for corprates, labs, schools and communities.