基本结构

Transformer block拆解

basic参数

  • : total number of transformer blocks

  • : number of units in each bottleneck layer, and number of units of each Q/K/V input

  • : number of heads of each transformer block

  • : input sequence length

derived参数

各参数在transformer block中的详细示意图如下(可双击放大):

Transformer block拆解

Zoom in Feed Forward子模块

Transformer block拆解

典型模型基本参数

应用 模型
NLP GPT-3 96 12288 96 2048
NLP BERT_Base 12 768 12 128/512
NLP BERT_Large 24 1024 16 128/512
RecSys BST 1 128(max) 8 20
  • BST: Behavior Sequence Transformer

References

  1. The GPT-3 Architecture, on a Napkin

  2. GPT-3 An Overview

  3. Language Models are Few-Shot Learners

  4. Improving Language Understanding by Generative Pre-Training

  5. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  6. Attention Is All You Need

  7. BERT transformer block code

  8. Deep Learning Recommendation Model for Personalization and Recommendation Systems

  9. Behavior Sequence Transformer for E-commerce Recommendation in Alibaba

相关文章:

  • 2021-12-14
  • 2021-06-13
  • 2021-06-28
  • 2021-08-14
猜你喜欢
  • 2022-01-21
  • 2021-12-01
  • 2021-04-13
  • 2021-05-23
  • 2021-05-09
相关资源
相似解决方案