【问题标题】:Spacy BILOU format to spacy json formatSpacy BILOU 格式转 spacy json 格式
【发布时间】:2020-11-04 07:21:14
【问题描述】:

我正在尝试将我的 spacy 版本升级到 nightly 版本,尤其是使用 spacy 转换器

所以我转换了类似格式的 spacy 简单训练数据集

td = [["Who is Shaka Khan?", {"entities": [(7, 17, "FRIENDS")]}],["I like London.", {"entities": [(7, 13, "LOC")]}],]

高于

[[{"head": 0, "dep": "", "tag": "", "orth": "Who", "ner": "O", "id": 0}, {"head": 0, "dep": "", "tag": "", "orth": "is", "ner": "O", "id": 1}, {"head": 0, "dep": "", "tag": "", "orth": "Shaka", "ner": "B-FRIENDS", "id": 2}, {"head": 0, "dep": "", "tag": "", "orth": "Khan", "ner": "L-FRIENDS", "id": 3}, {"head": 0, "dep": "", "tag": "", "orth": "?", "ner": "O", "id": 4}], [{"head": 0, "dep": "", "tag": "", "orth": "I", "ner": "O", "id": 0}, {"head": 0, "dep": "", "tag": "", "orth": "like", "ner": "O", "id": 1}, {"head": 0, "dep": "", "tag": "", "orth": "London", "ner": "U-LOC", "id": 2}, {"head": 0, "dep": "", "tag": "", "orth": ".", "ner": "O", "id": 3}]]

使用以下脚本

sentences = []
for t in td:
    doc = nlp(t[0])
    tags = offsets_to_biluo_tags(doc, t[1]['entities'])
    ner_info = list(zip(doc, tags))
    tokens = []
    for n, i in enumerate(ner_info):
        token = {"head" : 0,
        "dep" : "",
        "tag" : "",
        "orth" : i[0].orth_,
        "ner" : i[1],
        "id" : n}
        tokens.append(token)
    sentences.append(tokens)



with open("train_data.json","w") as js:
    json.dump(sentences,js)```


then i tried to convert this train_data.json using 
spacy's convert command

```python -m spacy convert train_data.json converted/```


but the result in converted folder is

```✔ Generated output file (0 documents): converted/train_data.spacy``` 

which means it doesn't created dataset

can anybody help on what i am missing

i am trying to do this with spacy-nightly

【问题讨论】:

    标签: python spacy spacy-transformers


    【解决方案1】:

    您可以跳过中间 JSON 步骤,将注释直接转换为 DocBin

    import spacy
    from spacy.training import Example
    from spacy.tokens import DocBin
    
    td = [["Who is Shaka Khan?", {"entities": [(7, 17, "FRIENDS")]}],["I like London.", {"entities": [(7, 13, "LOC")]}],]
    
    nlp = spacy.blank("en")
    db = DocBin()
    
    for text, annotations in td:
        example = Example.from_dict(nlp.make_doc(text), annotations)
        db.add(example.reference)
    
    db.to_disk("td.spacy")
    

    见:https://nightly.spacy.io/usage/v3#migrating-training-python

    (如果您确实想使用中间 JSON 格式,请参考以下规范:https://spacy.io/api/annotation#json-input。您可以在 tokens 中包含 orthner 并省略其他功能,但您需要此结构包含paragraphsrawsentences。示例如下:https://github.com/explosion/spaCy/blob/45c9a688285081cd69faa0627d9bcaf1f5e799a1/examples/training/training-data.json)

    【讨论】:

    • @abb 感谢您的回复,但是当我打印时我有疑问(example.reference)我只能看到“谁是沙卡汗?我喜欢伦敦。但没有任何实体是预期的或者是有问题
    • example.referenceDoc,所以print(doc) 只显示doc.text。查看doc.ents 以查看实体。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-10-02
    • 2022-01-02
    • 1970-01-01
    • 2021-03-13
    • 1970-01-01
    相关资源
    最近更新 更多