Title: A Tensor Operator Compiler for Python
Date: Wednesday, May 10, 2023
Time: 1:30pm - 3:00pm EST
Location: 3402 Klaus, Virtual: https://gatech.zoom.us/my/tzhou80?pwd=b3EzUGRiWGY5Y2VkamZtdVJ2WTh5UT09
Tong Zhou
Ph.D. Computer Science
School of Computer Science
Georgia Institute of Technology
Committee:
Dr. Vivek Sarkar (Advisor) - School of Computer Science, Georgia Institute of Technology
Dr. Jun Shirako - School of Computer Science, Georgia Institute of Technology
Dr. Santosh Pande - School of Computer Science, Georgia Institute of Technology
Dr. Richard Vuduc - School of Computational Science, Georgia Institute of Technology
Abstract:
Kernels in machine learning, data science and scientific computing applications are often expressed as a sequence of matrix operations. These programs typically consist of a combination of element-wise (point-wise) operators such as addition and multiplication; reduction operators, such as summation; and, matrix multiplications (matmuls). Moreover, dimension broadcasting is often required, e.g., when a 2D matrix is added to a 1D vector. A common approach to implementing such kernels is to create a separate hand-optimized library implementation for each operator. Such hand-coded implementations are often backed by vendor libraries, such as cuBLAS and Intel MKL to achieve high performance on targeted hardware. One downside of this approach is that when calling a library function for each operator, intermediate results must be stored and subsequently loaded by the next operator. This overhead becomes increasingly significant for operators with low arithmetic intensity, such as point-wise reduction operators, and tall and skinny matrices, which also exhibit lower arithmetic intensities.
To enhance overall performance in these cases, a series of operators can be computed in a fused fashion (known as operator fusion, a concept akin to loop fusion) so as to reduce or eliminate loads and stores of intermediate results. However, such fused multi-operator kernels must be either created manually by performance experts, or generated automatically by a compiler. While it is possible for performance experts to manually create fused kernels for common patterns for specific hardware platforms, it becomes impractical for them to anticipate and cover all desired combinations of operator sequences and target platforms. Therefore, using a compiler-based approach to automatically generate fused kernels is a more viable and cost-effective approach. However, past compiler based efforts have either only focused on a few hand-picked application-specific combinations of operators or were too general to generate performance competitive fused kernels.
In this work, we advance the state-of-the-art of compiler-based operator fusion in several aspects. 1) We introduce an intermediate presentation: tiled index tree on which several transformations can be easily applied such as fusion, interchanging and tiling. 2) We broaden the operator fusion scope to fuse arbitrary element-wise, reduction and matmul operators based on automatic tiled index tree transformations. 3) We leverage operator-level information and propose dependence-breaking transformations to fuse otherwise unfusable operator patterns, so as to enable single-pass incremental fused computation of special patterns. 4) We extend the fusion scope to also fuse sparse operators, identify four common types of redundancies that can appear when fusing sparse and dense operators and introduce techniques to eliminate them. 5) We deal with the presence of explicit loops, and propose techniques such as loop invariant code motion, memory reusing to further improve the performance. 6) We implement the techniques in a prototype Python compiler framework -- Intrepydd and perform evaluation on both CPU and GPU using a set of representative tensor operator sequences/programs on representative input shapes. As of now, we have some preliminary implementations and results. In the proposed work, we plan to finish the implementation and evaluation.