TCP: A Tensor Contraction Processor for AI Workloads

Hanjoon Kim, Younggeun Choi, Junyoung Park, Byeongwook Bae, Hyunmin Jeong, Sang Min Lee, Jeseung Yeon, Minho Kim, Changjae Park, Boncheol Gu, Changman Lee, Jaeick Bae, SungGyeong Bae, Yojung Cha, Wooyoung Choe, Jonguk Choi, Juho Ha, Hyuck Han, Namoh Hwang, Seokha Hwang, Kiseok Jang, Haechan Je, Hojin Jeon, Jaewoo Jeon, Hyunjun Jeong, Yeonsu Jung, Dongok Kang, Hyewon Kim, Minjae Kim, Muhwan Kim, Sewon Kim, Suhyung Kim, Won Kim, Yong Kim, Youngsik Kim, Younki Ku, Jeong Ki Lee, Juyun Lee, Kyungjae Lee, Seokho Lee, Minwoo Noh, Hyuntaek Oh, Gyunghee Park, Sanguk Park, Jimin Seo, Jungyoung Seong, June Paik, Nuno P. Lopes, Sungjoo Yoo

 

Abstract:

We introduce a novel tensor contraction processor (TCP) architecture that offers a paradigm shift from traditional architectures that rely on fixed-size matrix multiplications. TCP aims at exploiting the rich parallelism and data locality inherent in tensor contractions, thereby enhancing both efficiency and performance of AI workloads.
TCP is composed of coarse-grained processing elements (PEs) to simplify software development. In order to efficiently process operations with diverse tensor shapes, the PEs are designed to be flexible enough to be utilized as a large-scale single unit or a set of small independent compute units.
We aim at maximizing data reuse on both levels of inter and intra compute units. To do that, we propose a circuit switch-based fetch network to flexibly connect compute units to enable inter-compute unit data reuse. We also exploit input broadcast to multiple contraction engines and input buffer based reuse to further exploit reuse behavior in tensor contraction. Our compiler explores the design space of tensor contractions considering tensor shapes and the order of their associated loop operations as well as the underlying accelerator architecture.
A TCP chip was designed and fabricated in 5nm technology as the second-generation product of Furiosa AI, offering 256/512/1024 TOPS (BF16/FP8 or INT8/INT4) with 256 MB SRAM and 1.5 TB/s 48 GB HBM3 under 150 W TDP. Commercialization will start in August 2024.
We performed an extensive case study of running the LLaMA-2 7B model and evaluated its performance and power efficiency on various configurations of sequence length and batch size. For this model, TCP is 2.7x and 4.1x better than H100 and L40s, respectively, in terms of performance per watt.

 

Published:

H. Kim, Y. Choi, J. Park, B. Bae, H. Jeong, S. M. Lee, J. Yeon, M. Kim, C. Park, B. Gu, C. Lee, J. Bae, S. Bae, Y. Cha, W. Choe, J. Choi, J. Ha, H. Han, N. Hwang, S. Hwang, K. Jang, H. Je, H. Jeon, J. Jeon, H. Jeong, Y. Jung, D. Kang, H. Kim, M. Kim, M. Kim, S. Kim, S. Kim, W. Kim, Y. Kim, Y. Kim, Y. Ku, J. K. Lee, J. Lee, K. Lee, S. Lee, M. Noh, H. Oh, G. Park, S. Park, J. Seo, J. Seong, J. Paik, N. P. Lopes, S. Yoo. TCP: A Tensor Contraction Processor for AI Workloads. In Proc. of the 51st International Symposium on Computer Architecture (ISCA), July 2024.

 

Download:

 

Bibtex:

@inproceedings{tcp-isca24,
  title =	{{TCP}: A Tensor Contraction Processor for {AI} Workloads},
  author =	{Hanjoon Kim and Younggeun Choi and Junyoung Park and Byeongwook Bae and Hyunmin Jeong and Sang Min Lee and Jeseung Yeon and Minho Kim and Changjae Park and Boncheol Gu and Changman Lee and Jaeick Bae and SungGyeong Bae and Yojung Cha and Wooyoung Choe and Jonguk Choi and Juho Ha and Hyuck Han and Namoh Hwang and Seokha Hwang and Kiseok Jang and Haechan Je and Hojin Jeon and Jaewoo Jeon and Hyunjun Jeong and Yeonsu Jung and Dongok Kang and Hyewon Kim and Minjae Kim and Muhwan Kim and Sewon Kim and Suhyung Kim and Won Kim and Yong Kim and Youngsik Kim and Younki Ku and Jeong Ki Lee and Juyun Lee and Kyungjae Lee and Seokho Lee and Minwoo Noh and Hyuntaek Oh and Gyunghee Park and Sanguk Park and Jimin Seo and Jungyoung Seong and June Paik and Nuno P. Lopes and Sungjoo Yoo},
  booktitle =	{Proc. of the 51st International Symposium on Computer Architecture (ISCA)},
  month =	jul,
  year =	2024
}

 

Copyright notice:

© 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

 

<-- Return