先一句话总结: 算法探索时,GPU方便;算法稳定,TPU可以做极致优化。

TPU的核心

  1. TPU硬件上就是一个超大的矩阵乘单元阵列。 模型师的目标是尽量让计算填满这个矩阵单元,这样FLOPs打满,效果杠杠,但是如果没填满,问题多多。
  2. XLA是TPU的核心。 XLA编译TensorFlow、JAX高级模型语言成TPU硬件语言,但是这种全图编译的生态系统,发展始终会滞后GPU kernel的发展。

举例子

  1. FlashAttention: FlashAttention可以算Transformer上最成功的算法优化。研究者有了这个idea,最先是在GPU上写自定义算子,做试验以及推广。而XLA那边有自己的开发节奏,只有等到明确知道FlashAttention有用了,才会开始优化,所以”滞后”明显。当然,现在最新的XLA基本把FlashAttention优化到了极致,效果现在比GPU上的还好。

  2. MoE结构: 最近火热的MoE结构,至今大家都没有基本共识比如最重要的怎么做token-expert routing(或者知道了也藏着掖着)。这样的情况下,对这种还在早期探索的算法,大多数研究者默认都会选择GPU,所以看到的大部分MoE的论文试验都在GPU下完成的。当然谷歌在花大力气做MoE,TPU上也一直在优化,但是还是那句话,最新idea会有”滞后”。

  3. Transformer演进: 远一点,Transformer模型在TPU v2/v3上也能跑,比如谷歌的BERT就是在TPU上训练的。但是,v2/v3是对CNN这些早期深度学习优化的,所以Transformer模型工程师不得不”迁就”TPU,比如魔改batch size和sequence length,普遍反映是不如GPU”顺手”。等到TPU v5,对Transformer和大模型训练专门优化,这时候LLM训练和推理用TPU就非常愉快。但是,其他传统CNN-based model和Multidimensional data的研究者用起来就非常痛苦。

驳斥一些观点

  1. “GPU通用,TPU好多活都不能干”
    • 错。 只要能把问题转换成矩阵运算,TPU都能做,就是可能没有GPU效率高。科学计算比如protein folding、weather prediction、molecule dynamics等等都有JAX实现,而且效果都不错。
  2. “新算法出来,TPU就要重新设计”
    • 错。 大部分情况,只需要更新XLA就可以满足新算法,但是更新XLA的effort远超手搓CUDA算子,所以新算法通常是先在GPU上使用和推广。
  3. “GPU危,TPU要占领市场”
    • 错。 只要没到AGI,研究者还在探索算法,GPU永远会领先。如果TPU占领过半的市场,那其实是在宣告AGI进入寒冬,希望永远不要发生。AMD的Lisa Su预计”ASIC占有率20-25%”,这是比较合理的数字。

TL;DR: GPU is better for algorithm exploration; TPU excels at optimization once algorithms are stable.

The Essence of TPU

  1. TPU hardware is essentially a massive matrix multiplication unit array. The goal for ML engineers is to keep these matrix units fully utilized—when FLOPs are maxed out, performance is excellent. But if not fully utilized, problems arise.
  2. XLA is the core of TPU. XLA compiles high-level ML languages like TensorFlow and JAX into TPU hardware instructions. However, this whole-graph compilation ecosystem will always lag behind GPU kernel development.

Examples

  1. FlashAttention: FlashAttention is arguably the most successful algorithmic optimization for Transformers. When researchers came up with this idea, they first wrote custom operators on GPU for experimentation and promotion. XLA has its own development pace—optimization only begins once FlashAttention is proven useful, hence the obvious “lag.” That said, the latest XLA has now optimized FlashAttention to perfection, even outperforming GPU implementations.

  2. MoE Architecture: The recently popular MoE (Mixture of Experts) structure still lacks consensus on critical aspects like token-expert routing (or people know but keep it secret). In this situation, most researchers default to GPU for early-stage algorithm exploration, which is why most MoE papers run experiments on GPU. Google is investing heavily in MoE and continuously optimizing on TPU, but the “lag” for cutting-edge ideas persists.

  3. Transformer Evolution: Going back further, Transformer models could run on TPU v2/v3—Google’s BERT was trained on TPU. However, v2/v3 was optimized for CNN-era deep learning, so Transformer engineers had to “accommodate” TPU by tweaking batch size and sequence length. The common feedback was that it wasn’t as “smooth” as GPU. With TPU v5, specifically optimized for Transformers and large model training, LLM training and inference on TPU became delightful. However, researchers working on traditional CNN-based models and multidimensional data found it painful.

Debunking Some Myths

  1. “GPU is versatile, TPU can’t do many things”
    • Wrong. As long as you can convert the problem into matrix operations, TPU can handle it—just maybe not as efficiently as GPU. Scientific computing like protein folding, weather prediction, and molecular dynamics all have JAX implementations with good results.
  2. “New algorithms require TPU redesign”
    • Wrong. In most cases, updating XLA is sufficient for new algorithms. However, the effort to update XLA far exceeds writing custom CUDA kernels, so new algorithms typically debut on GPU first.
  3. “GPU is in danger, TPU will dominate the market”
    • Wrong. As long as AGI hasn’t arrived and researchers are still exploring algorithms, GPU will always lead. If TPU captures over half the market, it would actually signal an AI winter—hopefully that never happens. AMD’s Lisa Su predicts “ASIC market share at 20-25%”—a reasonable estimate.