使用 spaCy 3.0 将数据从旧的 Spacy v2 格式转换为全新的 Spacy v3 格式答案

【问题标题】：Using spaCy 3.0 to convert data from old Spacy v2 format to the brand new Spacy v3 format使用 spaCy 3.0 将数据从旧的 Spacy v2 格式转换为全新的 Spacy v3 格式
【发布时间】：2021-05-05 19:13:09
【问题描述】：

我有变量trainData，它具有以下简化格式。

[

('Paragraph_A', {"entities": [(15, 26, 'DiseaseClass'), (443, 449, 'DiseaseClass'), (483, 496, 'DiseaseClass')]}),
('Paragraph_B', {"entities": [(969, 975, 'DiseaseClass'), (1257, 1271, 'SpecificDisease')]}),
('Paragraph_C', {"entities": [(0, 27, 'SpecificDisease')]})
]

我正在尝试将trainData 转换为.spacy，首先将其转换为doc，然后再转换为DocBin。整个trainData 文件可通过GoogleDocs 访问。

我试图重现本教程中提到的内容，但对我不起作用。教程为：Using spaCy 3.0 to build a custom NER model

我尝试了以下方法。

import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object

for text, annot in trainData: # data in previous format
    doc = nlp.make_doc(text) # create doc object from text
    ents = []
    for start, end, label in annot["entities"]: # add character indexes
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        ents.append(span)
    doc.ents = span # label the text with the ents
    db.add(doc)

db.to_disk("./train.spacy") # save the docbin object

但我在如何将数据从Spacy v2 转换为Spacy v3 的代码中弄错了。在上面的代码sn-p中，我得到了一个回溯： TypeError: 'spacy.tokens.token.Token' object is not iterable.

【问题讨论】：

不确定这是否是您唯一的问题，但doc.ents = span 应该是doc.ents = ents。
经过调查，我认为这是您唯一的问题，假设您的注释没有问题。
请问有哪些问题？

标签： python nlp spacy data-conversion

【解决方案1】：

你有一个小错误。检查 XXX 是否有更改的行。

import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object

for text, annot in trainData: # data in previous format
    doc = nlp.make_doc(text) # create doc object from text
    ents = []
    for start, end, label in annot["entities"]: # add character indexes
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        ents.append(span)
    #XXX FOLLOWING LINE CHANGED
    doc.ents = ents # label the text with the ents
    db.add(doc)

db.to_disk("./train.spacy") # save the docbin object

【讨论】：

非常感谢，但刚刚测试了代码得到了这个回溯：ValueError: [E1010] Unable to set entity information for token 27 which is included in more than one span in entities, blocked, missing or outside.
是的，这是您的实体注释的问题。这就像说在I li[ke che]ese 中括号中的部分是一个人。如果您需要帮助，请使用示例数据提出问题。
啊，实际上，看起来您在同一个令牌上有两个注释或什么...？不过，这是一个注释问题，不看注释就无法修复。
我提供了注释here。
我更改了alignment_mode="strict" 并让您的代码相同。我得到了回溯：TypeError: object of type 'NoneType' has no len() in doc.ents = ents

【解决方案2】：

我在以下摘要的实体中发现了问题：

[Machado-Joseph disease, Machado-Joseph disease, MJD, MJD, MJD, MJD, Huntington disease, HD, HD, MJD, Machado-Joseph disease, Machado-Joseph disease, MJD, MJD, MJD, MJD, Huntington disease, HD, HD, MJD]

按照摘要：

8528200|t|Evidence for inter-generational instability in the CAG repeat in the MJD1 gene and for conserved haplotypes at flanking markers amongst Japanese and Caucasian subjects with Machado-Joseph disease.
8528200|a|The size of the (CAG)n repeat array in the 3' end of the MJD1 gene and the haplotype at a series of microsatellite markers surrounding the MJD1 gene were examined in a large cohort of Japanese and Caucasian subjects affected with Machado-Joseph disease (MJD). Our data provide five novel observations. First, MJD is associated with expansion fo the array from the normal range of 14-37 repeats to 68-84 repeats in most Japanese and Caucasian subjects, but no subjects were observed with expansions intermediate in size between those of the normal and MJD affected groups. Second, the expanded allele associated with MJD displays inter-generational instability, particularly in male meioses, and this instability was associated with the clinical phenomenon of anticipation. Third, the size of the expanded allele is not only inversely correlated with the age-of-onset of MJD (r = -0.738, p < 0.001), but is also correlated with the frequency of other clinical features [e.g. pseudoexophthalmos and pyramidal signs were more frequent in subjects with large repeats (p < 0.001 and p < 0.05 respectively)]. Fourth, the disease phenotype is significantly more severe and had an early age of onset (16 years) in a subject homozygous for the expanded allele, which contrasts with Huntington disease and suggests that the expanded allele in the MJD1 gene could exert its effect either by a dominant negative effect (putatively excluded in HD) or by a gain of function effect as proposed for HD. Finally, Japanese and Caucasian subjects affected with MJD share haplotypes at several markers surrounding the MJD1 gene, which are uncommon in the normal Japanese and Caucasian population, and which suggests the existence either of common founders in these populations or of chromosomes susceptible to pathologic expansion of the CAG repeat in the MJD1 gene.
8528200 173 195 Machado-Joseph disease  SpecificDisease D017827
8528200 427 449 Machado-Joseph disease  SpecificDisease D017827
8528200 451 454 MJD SpecificDisease D017827
8528200 506 509 MJD SpecificDisease D017827
8528200 748 751 MJD Modifier    D017827
8528200 813 816 MJD SpecificDisease D017827
8528200 1067    1070    MJD SpecificDisease D017827
8528200 1470    1488    Huntington disease  SpecificDisease D006816
8528200 1628    1630    HD  SpecificDisease D006816
8528200 1680    1682    HD  SpecificDisease D006816
8528200 1739    1742    MJD SpecificDisease D017827

其中t 代表标题，a 代表摘要。我们需要将它们连接起来。


def converter(data, outputFile):
    """
    Converts data to the new Spacy v3 format; .spacy binary format
    Inputs: 
        data: data should in the format of: (abstract, {'entities' : [(start, end, label), (start, end, label)]})
        outputFile: file name output
    Outputs:
        {outputFile}.spacy format file
    """
    nlp = spacy.blank("en") # load a new spacy model
    doc_bin = DocBin() # create a DocBin object

    for text, annot in tqdm(data): # data in previous format
        doc = nlp.make_doc(text) # create doc object from text    
        ents = []
        
        for start, end, label in annot["entities"]: # add character indexes
            # supported modes: strict, contract, expand
            span = doc.char_span(start, end, label=label, alignment_mode="strict")
            if span is None:
                # here only ignore the spans that are None; I skip those entities
                pass
            else:
                ents.append(span)
        try:
            doc.ents = ents # label the text with the ents
        except:
            # here only ignore the following abstract entities is ignored;
            # [Machado-Joseph disease, Machado-Joseph disease, MJD, MJD, MJD, MJD, Huntington disease, 
            # HD, HD, MJD, Machado-Joseph disease, Machado-Joseph disease, MJD, MJD, MJD, MJD, 
            # Huntington disease, HD, HD, MJD]
            pass
        doc_bin.add(doc)
        
    doc_bin.to_disk(f"./{outputFile}.spacy") # save the docbin object
    return f"Processed {len(doc_bin)}"

函数converter() 运行良好，但我忽略了上述实体。我仍然不知道如何处理这种情况，让 spaCy 能够不将其视为重复而不是忽略它。

【讨论】：