Tensor Comprehensions: deep learning as a polyhedral compiler's killer app
Deep learning models with convolutional and recurrent networks analyze massive amounts of audio, image, video, text and graph data, with applications to automatic translation, speech-to-text, scene understanding, ranking user preferences, ad placement, etc. Competing frameworks for building these networks such as TensorFlow, Chainer, CNTK, Torch/PyTorch, Caffe1/2, MXNet and Theano, explore different tradeoffs between usability and expressiveness, research or production orientation and supported hardware. They operate on a DAG of computational operators, wrapping high-performance libraries such as CUDNN (for NVIDIA GPUs) or NNPACK (for various CPUs), and automate memory allocation, synchronization, distribution. Custom operators are needed where the computation does not fit existing high-performance library calls, usually at a high engineering cost. Such operators suffer a severe performance penalty, which limits the pace of innovation. Furthermore, existing library primitives often do not offer optimal performance in a particular network architecture, missing optimizations between operators as well as specialization to the size and shape of data.
We will survey the work-in-progress design of
(1) a language close to the mathematics of deep learning called Tensor Comprehensions, featuring interesting developments in the areas of automatic range inference, declarative array programming, and data-flow modeling of recurrent networks;
(2) a polyhedral Just-In-Time compiler to convert a mathematical description of a deep learning DAG into a CUDA kernel with delegated memory management and synchronization, also providing optimizations such as operator fusion and specialization for specific sizes;
(3) a high level metaprogramming environment and compilation cache populated by an autotuner, acting as a built-to-order library.
Our first results demonstrate the suitability of the polyhedral framework to construct a fully automatic, domain-specific optimizer, effective on state-of-the-art deep learning models and targeting NVIDIA GPUs. Our compilation flow reaches up to 4x speedup over NVIDIA libraries on kernels relevant to the Machine Learning Community, and on an actual model used in production at Facebook. TC also facilitates algorithmic exploration, exposing up to 2 orders of magnitude speedup on research layers. It is open source, integrated with mainstream frameworks Caffe2 (production-oriented) and PyTorch (research-oriented). TC is still at an early stage, and looking for contributions and collaboration.