【发布时间】:2019-09-19 03:00:38
【问题描述】:
我正在尝试使用常见问题解答数据集进行数据增强。我用Wordnet检查与Spacy的相似性,用最相似的词来改变单词,特别是名词。我使用多个 for 循环来遍历我的数据集。
import spacy
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd
nlp = spacy.load('en_core_web_md')
nltk.download('wordnet')
questions = pd.read_csv("FAQ.csv")
list_questions = []
for question in questions.values:
list_questions.append(nlp(question[0]))
for question in list_questions:
for token in question:
treshold = 0.5
if token.pos_ == 'NOUN':
wordnet_syn = wn.synsets(str(token), pos=wn.NOUN)
for syn in wordnet_syn:
for lemma in syn.lemmas():
similar_word = nlp(lemma.name())
if similar_word.similarity(token) != 1. and similar_word.similarity(token) > treshold:
good_word = similar_word
treshold = token.similarity(similar_word)
但是,以下警告被打印了好几次,我不明白为什么:
UserWarning:[W008] 基于空向量评估 Doc.similarity。
是我的similar_word.similarity(token) 造成了问题,但我不明白为什么。
我的 list_questions 的形式是:
list_questions = [Do you have a paper or other written explanation to introduce your model's details?, Where is the BERT code come from?, How large is a sentence vector?]
我需要检查令牌以及循环中的similar_word,例如,我仍然在这里得到错误:
tokens = nlp(u'dog cat unknownword')
similar_word = nlp(u'rabbit')
if(similar_word):
for token in tokens:
if (token):
print(token.text, similar_word.similarity(token))
【问题讨论】:
标签: python-3.x nlp pytorch spacy wordnet