在 NLP 中提取包括连字符在内的复合名词的麻烦答案

【问题标题】：Trouble to extract compound nouns including hyphens in NLP在 NLP 中提取包括连字符在内的复合名词的麻烦
【发布时间】：2020-10-15 05:39:30
【问题描述】：

背景和目标

我想从每个句子中提取名词和复合名词，包括连字符，如下所示。如果它包含连字符，我需要用连字符提取它。

{The T-shirt is old.: ['T-shirt'], 
I bought the computer and the new web-cam.: ['computer', 'web-cam'], 
I bought the computer and the new web camera.: ['computer', 'web camera']}

问题

当前输出如下。复合名词的第一个词有标签，'复合'，但我现在无法提取我期望的内容。

T T PROPN NNP compound X True False
shirt shirt NOUN NN nsubj xxxx True False
computer computer NOUN NN dobj xxxx True False
web web NOUN NN compound xxx True False
cam cam NOUN NN conj xxx True False
computer computer NOUN NN dobj xxxx True False
web web NOUN NN compound xxx True False
camera camera NOUN NN conj xxxx True False

{The T-shirt is old.: ['T -', 'T', 'T -', 'shirt'], 
I bought the computer and the new web-cam.: ['web -', 'computer', 'web -', 'web', 'web -', 'cam'], 
I bought the computer and the new web camera.: ['web camera', 'computer', 'web camera', 'web', 'web camera', 'camera']}

当前代码

我正在使用 NLP 库 spaCy 来区分名词和复合名词。希望听到您的建议如何修复当前代码。

import spacy
nlp = spacy.load("en_core_web_sm")

texts =  ["The T-shirt is old.", "I bought the computer and the new web-cam.", "I bought the computer and the new web camera."]

nouns = []*len(texts)
dic = {k: v for k, v in zip(texts, nouns)}

for i in range(len(texts)):
    text = nlp(texts[i])
    words = []
    for word in text:
        if word.pos_ == 'NOUN'or word.pos_ == 'PROPN':
            print(word.text, word.lemma_, word.pos_, word.tag_, word.dep_,
                word.shape_, word.is_alpha, word.is_stop)

            #compound words
            for j in range(len(text)):
                    token = text[j]
                    if token.dep_ == 'compound':
                        if j < len(text)-1:
                            nexttoken = text[j+1]
                            words.append(str(token.text + ' ' + nexttoken.text))


            else:
                words.append(word.text)
    dic[text] = words       
print(dic)

开发环境

Python 3.7.4

spaCy 版本 2.3.2

【问题讨论】：

标签： python python-3.x string nlp spacy

【解决方案1】：

请尝试：

import spacy
nlp = spacy.load("en_core_web_sm")

texts =  ("The T-shirt is old",
          "I bought the computer and the new web-cam",
          "I bought the computer and the new web camera",
         )
docs = nlp.pipe(texts)  

compounds = []
for doc in docs:
    compounds.append({doc.text:[doc[tok.i:tok.head.i+1] for tok in doc if tok.dep_=="compound"]})
print(compounds)
[{'The T-shirt is old.': [T-shirt]}, 
{'I bought the computer and the new web-cam.': [web-cam]}, 
{'I bought the computer and the new web camera.': [web camera]}]

此列表中缺少computer，但我认为它不符合化合物的条件。

【讨论】：