如何使用数据集库构建用于语言建模的数据集，就像使用转换器库中的旧 TextDataset答案

【问题标题】：How to build a dataset for language modeling with the datasets library as with the old TextDataset from the transformers library如何使用数据集库构建用于语言建模的数据集，就像使用转换器库中的旧 TextDataset
【发布时间】：2026-02-04 02:40:01
【问题描述】：

我正在尝试加载一个自定义数据集，然后将其用于语言建模。数据集由一个文本文件组成，每行包含一个完整的文档，这意味着每一行都超出了大多数分词器的正常 512 个令牌限制。

我想了解构建对每一行进行标记的文本数据集的过程是什么，之前已将数据集中的文档拆分为“可标记”大小的行，就像旧的 TextDataset 类所做的那样，其中您只需执行以下操作，就可以将没有文本丢失的标记化数据集传递给 DataCollator：

model_checkpoint = 'distilbert-base-uncased'

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

from transformers import TextDataset

dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="path/to/text_file.txt",
    block_size=512,
)

我想使用datasets 库，而不是这种即将被弃用的方式。目前，我所拥有的是以下内容，当然，这会引发错误，因为每一行都比标记器中的最大块大小长：

import datasets
dataset = datasets.load_dataset('path/to/text_file.txt')

model_checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize_function(examples):
    return tokenizer(examples["text"])

tokenized_datasets = dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

那么以以前的方式创建数据集但使用数据集库的“标准”方式是什么？

非常感谢您的帮助:))

【问题讨论】：

这实际上取决于您的任务。分词器提供了几个选项，如截断、滑动窗口。只需检查其parameters。
是的，但这有点像什么都不说。正如我之前提到的，分词器已经被使用了。这个想法是将数据集划分为序列，然后可以对其进行标记，因此在此过程中不会丢失任何信息。
不确定我是否误解了你，但我的意思是你使用的分词器为你的用例使用了错误的选项，你还没有指定（语言建模非常广泛）。您可能想要使用的是滑动窗口方法，您必须自己决定溢出的令牌会发生什么。
是的，这就是我一直在寻找的，尽管我在具体实现方面遇到了麻烦。我在下面发布了一个答案，其中包含 HuggingFace Datasets 人员的详细信息:)

标签： python bert-language-model huggingface-transformers

【解决方案1】：

我在@lhoestq 的HuggingFace Datasets forum 上收到了这个问题的答案

嗨！

如果你想逐行标记，你可以使用这个：

max_seq_length = 512
num_proc = 4

def tokenize_function(examples):
    # Remove empty lines
    examples["text"] = [line for line in examples["text"] if len(line) > 0 and not line.isspace()]
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=max_seq_length,
    )

tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    num_proc=num_proc,
    remove_columns=["text"],
)

虽然 TextDataset 进行了不同的处理连接所有文本和大小为 512 的构建块。如果你需要这种行为，那么你必须应用一个额外的地图功能标记化后：

# Main data processing function that will concatenate all texts from
# our dataset and generate chunks of max_seq_length.
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop,
    # you can customize this part to your needs.
    total_length = (total_length // max_seq_length) * max_seq_length
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + max_seq_length] for i in range(0, total_length, max_seq_length)]
        for k, t in concatenated_examples.items()
    }
    return result

# Note that with `batched=True`, this map processes 1,000 texts together,
# so group_texts throws away a remainder for each of those groups of 1,000 texts.
# You can adjust that batch_size here but a higher value might be slower to preprocess.

tokenized_dataset = tokenized_dataset.map(
    group_texts,
    batched=True,
    num_proc=num_proc,
)

这段代码来自run_mlm.py示例脚本的处理变压器

【讨论】：