训练 spacy 进行文本分类答案

【问题标题】：train spacy for text classification训练 spacy 进行文本分类
【发布时间】：2021-11-23 12:30:26
【问题描述】：

在阅读了docs 并完成了tutorial 之后，我想我会做一个小演示。原来我的模型不想训练。这是代码

import spacy
import random
import json

TRAINING_DATA = [
    ["My little kitty is so special", {"KAT": True}],
    ["Dude, Totally, Yeah, Video Games", {"KAT": False}],
    ["Should I pay $1,000 for the iPhone X?", {"KAT": False}],
    ["The iPhone 8 reviews are here", {"KAT": False}],
    ["Noa is a great cat name.", {"KAT": True}],
    ["We got a new kitten!", {"KAT": True}]
]

nlp = spacy.blank("en")
category = nlp.create_pipe("textcat")
nlp.add_pipe(category)
category.add_label("KAT")

# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(100):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        texts = [text for text, entities in batch]
        annotations = [{"textcat": [entities]} for text, entities in batch]
        nlp.update(texts, annotations, losses=losses)
    if itn % 20 == 0:
        print(losses)

当我运行它时，输出表明我学到的东西很少。

{'textcat': 0.0}
{'textcat': 0.0}
{'textcat': 0.0}
{'textcat': 0.0}
{'textcat': 0.0}

这感觉错误。应该有错误或有意义的标签。预测证实了这一点。

for text, d in TRAINING_DATA:
    print(text, nlp(text).cats)

# Dude, Totally, Yeah, Video Games {'KAT': 0.45303162932395935}
# The iPhone 8 reviews are here {'KAT': 0.45303162932395935}
# Noa is a great cat name. {'KAT': 0.45303162932395935}
# Should I pay $1,000 for the iPhone X? {'KAT': 0.45303162932395935}
# We got a new kitten! {'KAT': 0.45303162932395935}
# My little kitty is so special {'KAT': 0.45303162932395935}

感觉我的代码缺少了一些东西，但我不知道是什么。

【问题讨论】：

Here 他们使用了 2000 个示例。你确定机器学习适用于 6 个例子吗？您的所有三个猫示例都对猫使用了不同的词。我会从 10 个不同的例子开始，只用一个词来形容一只猫。
当然，但是 textcat 类别报告零损失，这不应该是这样。
你的训练循环和数据看起来是正确的——我想我发现了问题：尝试将{"textcat": [entities]}更改为{"cats": entities}（如果你传入一个dict，也可以将see here作为预期的键注释）。当你更新文本分类器时，它会寻找一个键 "cats"——但那个键不存在，只有 "textcat"。所以你基本上什么都没有更新文本分类器，最后只得到了随机初始化的权重（来自nlp.begin_training）。

标签： python spacy

【解决方案1】：

如果您更新并使用 spaCy 3 - 上面的代码将不再有效。解决方案是进行一些更改进行迁移。我已经相应地修改了 cantdutchthis 中的示例。

变更摘要：

使用配置更改架构。旧的默认值是“词袋”，新的默认值是使用注意力的“文本合奏”。调整模型时请记住这一点
标签现在需要一次性编码
add_pipe 界面略有变化
nlp.update 现在需要 Example 对象而不是 text、annotation 的元组

import spacy
# Add imports for example, as well as textcat config...
from spacy.training import Example
from spacy.pipeline.textcat import single_label_bow_config, single_label_default_config
from thinc.api import Config
import random

# labels should be one-hot encoded
TRAINING_DATA = [
    ["My little kitty is so special", {"KAT0": True}],
    ["Dude, Totally, Yeah, Video Games", {"KAT1": True}],
    ["Should I pay $1,000 for the iPhone X?", {"KAT1": True}],
    ["The iPhone 8 reviews are here", {"KAT1": True}],
    ["Noa is a great cat name.", {"KAT0": True}],
    ["We got a new kitten!", {"KAT0": True}]
]


# bow
# config = Config().from_str(single_label_bow_config)

# textensemble with attention
config = Config().from_str(single_label_default_config)

nlp = spacy.blank("en")
# now uses `add_pipe` instead
category = nlp.add_pipe("textcat", last=True, config=config)
category.add_label("KAT0")
category.add_label("KAT1")


# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(100):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=4):
        texts = [nlp.make_doc(text) for text, entities in batch]
        annotations = [{"cats": entities} for text, entities in batch]

        # uses an example object rather than text/annotation tuple
        examples = [Example.from_dict(doc, annotation) for doc, annotation in zip(
            texts, annotations
        )]
        nlp.update(examples, losses=losses)
    if itn % 20 == 0:
        print(losses)

【讨论】：

看起来配置变量已经初始化但没有在任何地方使用？文本猫模型如何选择配置？

【解决方案2】：

根据 Ines 的评论，这就是答案。

import spacy
import random
import json

TRAINING_DATA = [
    ["My little kitty is so special", {"KAT": True}],
    ["Dude, Totally, Yeah, Video Games", {"KAT": False}],
    ["Should I pay $1,000 for the iPhone X?", {"KAT": False}],
    ["The iPhone 8 reviews are here", {"KAT": False}],
    ["Noa is a great cat name.", {"KAT": True}],
    ["We got a new kitten!", {"KAT": True}]
]

nlp = spacy.blank("en")
category = nlp.create_pipe("textcat")
category.add_label("KAT")
nlp.add_pipe(category)

# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(100):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=1):
        texts = [nlp(text) for text, entities in batch]
        annotations = [{"cats": entities} for text, entities in batch]
        nlp.update(texts, annotations, losses=losses)
    if itn % 20 == 0:
        print(losses)

【讨论】：