在训练 Transformers HuggingFace 模型时，在每一步/epoch 打印输入/输出/grad/loss答案

【问题标题】：Print input / output / grad / loss at every step/epoch when training Transformers HuggingFace model在训练 Transformers HuggingFace 模型时，在每一步/epoch 打印输入/输出/grad/loss
【发布时间】：2021-10-15 23:17:01
【问题描述】：

我正在研究 HuggingFace 变形金刚并使用此处的玩具示例： https://huggingface.co/transformers/custom_datasets.html#fine-tuning-with-trainer

我真正需要的是：能够在每一步打印输入、输出、梯度和损失。使用 Pytorch 训练循环很简单，但使用 HuggingFace Trainer 并不明显。目前我有下一个想法：像这样创建一个CustomCallback：

class MyCallback(TrainerCallback):
    "A callback that prints a grad at every step"

    def on_step_begin(self, args, state, control, **kwargs):
        print("next step")
        print(kwargs['model'].classifier.out_proj.weight.grad.norm())

args = TrainingArguments(
    output_dir='test_dir',
    overwrite_output_dir=True,
    num_train_epochs=1,
    logging_steps=100,
    report_to="none",
    fp16=True,
    disable_tqdm=True,
)


trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    callbacks=[MyCallback],
)

trainer.train()

这样我可以打印任何模型层的梯度和权重。但是我还是想不通如何打印输入/输出（例如，我想在nan 上检查它们）和丢失？

附：我还阅读了一些关于 forward_hook 的内容，但仍然找不到合适的代码示例。

【问题讨论】：

标签： python logging neural-network pytorch huggingface-transformers

【解决方案1】：

虽然使用钩子和自定义回调是解决问题的正确方法，但我找到了更好的解决方案 - 使用内置实用程序在损失/权重/输入/输出中查找 nan/Inf： https://huggingface.co/transformers/internal/trainer_utils.html#transformers.debug_utils.DebugUnderflowOverflow 因为 4.6.0 的转换器有这样的选项。

您可以在 forward 函数中手动使用它，或者只使用 TrainingArguments 的附加选项，如下所示：

args = TrainingArguments(
    output_dir='test_dir',
    overwrite_output_dir=True,
    num_train_epochs=1,
    logging_steps=100,
    report_to="none",
    fp16=True,
    disable_tqdm=True,
    debug="debug underflow_overflow"
)

【讨论】：