写在前面

本文主要记录 nanoGPT 的学习过程，参考链接

GPT in 60 Lines of NumPy：https://jaykmody.com/blog/gpt-from-scratch/
60 行代码实现 gpt（上一篇的翻译）： https://zhuanlan.zhihu.com/p/679330102
nanoGPT 实战： https://zhuanlan.zhihu.com/p/716442447
nanoGPT 代码解读：https://zhuanlan.zhihu.com/p/677407971

GPT 原理

GPT(Generative Pre-trained Transformer)基于Transformer解码器自回归地预测下一个Token，从而进行了语言模型的建模。GPT的伪代码可以简单的表示为：

def gpt(inputs: list[int]) -> list[list[float]]:
	""" GPT代码，实现预测下一个token
	inputs：List[int], shape为[n_seq]，输入文本序列的token id的列表
	output：List[List[int]], shape为[n_seq, n_vocab]，预测输出的logits列表
	"""
    output = # 需要实现的GPT内部计算逻辑 
    return output

即输入一段token

关于 token

token 可以理解为一个句子中最小的组成部分。通常为一个词，一些情况下，可以进行简化，例如后续使用莎士比亚的作品集进行训练时，将字符作为 token。

token 通过分词器来获取，对应一个词汇表。最开始输入到模型中的序列其实为一串数字，表示当前 token 在词汇表中的位置。例如：

# 词汇表中的token索引表示该token的整数ID
# 例如，"robot"的整数ID为1，因为vocab[1] = "robot"
vocab = ["must", "robot", "obey", "the", "orders", "."]

# 进行分词的分词器tokenizer（假设通过空格来进行分词）
tokenizer = WhitespaceTokenizer(vocab)

# encode()方法将str字符串转换为list[int]
ids = tokenizer.encode("robot must obey orders") # ids = [1, 0, 2, 4]

# 通过词汇表映射，可以看到实际的token是什么
tokens = [tokenizer.vocab[i] for i in ids] # tokens = ["robot", "must", "obey", "orders"]

# decode()方法将list[int] 转换回str
text = tokenizer.decode(ids) # text = "robot must obey orders"

同样，输出为一个二维数组，表示当前位置不同 token 的出现概率。output 是一个二维数组，其中 output[i][j] 表示文本序列的第 i 个位置的 token（inputs[i]）是词汇表的第 j 个 token（vocab[j]）的概率（实际为未归一化的logits得分）。例如：

inputs = [1, 0, 2, 4]  # "robot" "must" "obey" "orders"
vocab = ["must", "robot", "obey", "the", "orders", "."]
output = gpt(inputs)

# output[0] = [0.75, 0.1, 0.15, 0.0, 0.0, 0.0]
# 给定 "robot"，模型预测 "must" 的概率最高

# output[1] = [0.0, 0.0, 0.8, 0.1, 0.0, 0.1]
# 给定序列 ["robot", "must"]，模型预测 "obey" 的概率最高

# output[-1] = [0.0, 0.0, 0.1, 0.0, 0.85, 0.05]
# 给定整个序列["robot", "must", "obey"]，模型预测 "orders" 的概率最高
next_token_id = np.argmax(output[-1])  # next_token_id = 4
next_token = vocab[next_token_id]      # next_token = "orders"

在推理时（生成文本），首先将 prompt 输入 GPT，然后迭代地将上一轮的输出放到当前的末尾，重复生成。例如：

def generate(inputs, n_tokens_to_generate):
	""" GPT生成代码
	inputs: list[int], 输入文本的token ids列表
	n_tokens_to_generate：int, 需要生成的token数量
	"""
    # 自回归式解码循环
    for _ in range(n_tokens_to_generate): 
        output = gpt(inputs)            # 模型前向推理，输出预测词表大小的logits列表
        next_id = np.argmax(output[-1]) # 贪心采样
        inputs.append(int(next_id))     # 将预测添加回输入
    return inputs[len(inputs) - n_tokens_to_generate :]  # 只返回生成的ids

# 随便举例
input_ids = [1, 0, 2]                          # ["robot", "must", "obey"]
output_ids = generate(input_ids, 1)            #  output_ids = [1, 0, 2, 4]
output_tokens = [vocab[i] for i in output_ids] # ["robot", "must", "obey", "orders"]

代码

首先，最外层调用的是 GPT 类。调用方法为：

1	logits, loss = model(X, Y)

其中 X，Y 表示输入以及其对应的标签，注意这里已经为 int 类型的数组了（表示 token 在词汇表中的位置）。

GPT 类

整体结构

def forward(self, idx, targets=None):
    device = idx.device
    b, t = idx.size()
    assert t <= self.config.block_size, f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}"
    pos = torch.arange(0, t, dtype=torch.long, device=device)
    tok_emb = self.transformer.wte(idx) 
    pos_emb = self.transformer.wpe(pos) 
    x = self.transformer.drop(tok_emb + pos_emb)
    for block in self.transformer.h:
        x = block(x)
    x = self.transformer.ln_f(x)
    if targets is not None:
        logits = self.lm_head(x)
        loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
    else:
        logits = self.lm_head(x[:, [-1], :]) # note: using list [-1] to preserve the time dim
        loss = None
    return logits, loss

输入 idx 类型为 list[int]，表示输入 token 在词汇表中的索引，并且有 batch 维度。此处有一个断言，即要求序列的长度要小于块长度，即 block_size 表示模型能处理的最大长度。
新建一个位置数组 pso，用于计算位置编码。

接下来为核心代码 self.transformer，实现如下

self.transformer = nn.ModuleDict(dict(
    wte = nn.Embedding(config.vocab_size, config.n_embd),
    wpe = nn.Embedding(config.block_size, config.n_embd),
    drop = nn.Dropout(config.dropout),
    h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
    ln_f = LayerNorm(config.n_embd, bias=config.bias),
))

其包含 5 个部分：token 编码（word token embedding, wte）、位置编码（word position embedding, wpe）、dropout、注意力块（block）和层归一化。
整体处理流程为：
- token 编码 + 位置编码，并相加
- 依次通过注意力层
- 经过最终的映射层（将注意力的输出映射到词汇表维度）
- 最后进行判断是否计算 loss

token 编码和位置编码

token 编码：wte 是一个 [n_vocab, n_embd] 大小的可学习参数矩阵，它充当一个 token 嵌入查找表，其中矩阵的第 i 对应于词汇表中第 i 个 token 的 embedding。
- wte[idx] 使用 Token Ids 列表索引来检索与输入中每个token对应的向量。
位置编码：表示序列的先后信息，同样是一个 [n_block, n_embd] 大小的可学习参数矩阵。

Block 类

Block 类的实现如下：

class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln_1 = LayerNorm(config.n_embd, bias=config.bias)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = LayerNorm(config.n_embd, bias=config.bias)
        self.mlp = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

其主要包含两个层归一化、MLP和注意力层。

CausalSelfAttention 类

实现注意力机制的核心类。

def forward(self, x):
    B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)

    # calculate query, key, values for all heads in batch and move head forward to be the batch dim
    q, k, v  = self.c_attn(x).split(self.n_embd, dim=2)
    k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
    q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
    v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

    # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
    if self.flash:
        # efficient attention using Flash Attention CUDA kernels
        y = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=self.dropout if self.training else 0, is_causal=True)
    else:
        # manual implementation of attention
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        att = self.attn_dropout(att)
        y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
    y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side

    # output projection
    y = self.resid_dropout(self.c_proj(y))
    return y

self.c_attn(x) 表示为注意力机制的映射层，并将 Q、K、V 三个映射层合并为一个，减少计算量。计算出映射矩阵后再进行划分。
1
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
接下来根据 head 的数量对映射矩阵进行划分。然后就是计算注意力。

注意，为了实现因果机制，即模型只能看到当前 token 之前的 token，需要将计算出的 attn 矩阵 mask 一部分。

# 输入是 ["not", "all", "heroes", "wear", "capes"] 

# 原始自注意力
        not    all   heroes  wear  capes
   not 0.116  0.159  0.055  0.226  0.443
   all 0.180  0.397  0.142  0.106  0.175
heroes 0.156  0.453  0.028  0.129  0.234
  wear 0.499  0.055  0.133  0.017  0.295
 capes 0.089  0.290  0.240  0.228  0.153

 # 因果自注意力 （行为j, 列为i）
 # 为防止输入的所有查询都能预测未来，需要将所有j>i位置设置为0 ：
        not    all   heroes  wear  capes
   not 0.116  0.     0.     0.     0.
   all 0.180  0.397  0.     0.     0.
heroes 0.156  0.453  0.028  0.     0.
  wear 0.499  0.055  0.133  0.017  0.
 capes 0.089  0.290  0.240  0.228  0.153

 # 在应用 softmax 之前，我们需要修改我们的注意力矩阵，得到掩码自注意力
 # 即，在softmax之前将要屏蔽项的注意力得分设置为 −∞（归一化系数为0）
 # mask掩码矩阵
 0 -1e10 -1e10 -1e10 -1e10
 0   0   -1e10 -1e10 -1e10
 0   0     0   -1e10 -1e10
 0   0     0     0   -1e10
 0   0     0     0     0

 使用 -1e10 而不是 -np.inf ，因为 -np.inf 可能会导致 nans

至此，整个 GPT 的结构拆解完毕。