如何从文档中查找和打印不匹配/不相似的单词？答案

【问题标题】：How can I find and print unmatched/dissimilar words from the documents?如何从文档中查找和打印不匹配/不相似的单词？
【发布时间】：2019-02-11 09:05:05
【问题描述】：

我正在尝试重写基本上采用输入文本文件并与不同文档进行比较并得出相似性的算法。

现在我想打印不匹配单词的输出并输出带有不匹配单词的新纺织品。

从这段代码中，“hello force”是输入，并根据 raw_documents 检查并打印出 0-1 之间匹配文档的排名（单词“force”与第二个文档匹配，输出为第二个文档提供更高的排名，但“hello”不在任何raw_document中我想打印不匹配的单词“hello”作为不匹配），但我想要打印不匹配的输入单词与任何raw_document都不匹配

import gensim
import nltk

from nltk.tokenize import word_tokenize

raw_documents = ["I'm taking the show on the road",
                 "My socks are a force multiplier.",
             "I am the barber who cuts everyone's hair who doesn't 
cut their own.",
             "Legend has it that the mind is a mad monkey.",
            "I make my own fun."]

gen_docs = [[w.lower() for w in word_tokenize(text)]
            for text in raw_documents]

dictionary = gensim.corpora.Dictionary(gen_docs)

corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]

tf_idf = gensim.models.TfidfModel(corpus)
s = 0
for i in corpus:
    s += len(i)
sims =gensim.similarities.Similarity('/usr/workdir/',tf_idf[corpus],
                                  num_features=len(dictionary))
query_doc = [w.lower() for w in word_tokenize("hello force")]

query_doc_bow = dictionary.doc2bow(query_doc)

query_doc_tf_idf = tf_idf[query_doc_bow]
result = sims[query_doc_tf_idf] 
print result

【问题讨论】：

目前还不清楚你想做什么，为什么。您可能想编辑您的问题以澄清：您的文档是真正的自然语言文本吗？你想让你的“单词匹配”帮助确定一个文本是否与其他已知文本相似吗？

标签： python scikit-learn nltk gensim

【解决方案1】：

如果您只想知道“hello'”这个词不在其他文档中，那甚至可能不需要像gensim 这样的自然语言帮助库。你可以只记录所有看到的单词——为此，一个普通的 Python dict 或 set 或 Counter 就足够了。（加载所有单词后，只需依次检查新文本中的每个单词。）

Gensim 的 TfidfModel 和 Similarity 实际上是比较的。处理更微妙的相对比较度（不是单词存在的“是/否”）。

而且，gensim.corpora.Dictionary.doc2bow() 方法通常会忽略字典中未知的单词——因为它们没有分配的位置，因此可能很少见/不重要——而不是包含它们在返回的数据中。因此，它默认返回的“词袋”表示，本质上是(known_word_index, count) 的列表，无法帮助简单地检测未知词。

但是，您可以查看其可选的return_missing 参数，并请求return_missing=True。然后，它返回一个 (bag_of_words, dict_of_missing_words) 元组 - 并通过查看第二个返回值，查看 gensim.corpora.Dictionary 对象中已有哪些词不。见：

https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2bow

【讨论】：