给定文档集群，计算语料库和集群之间的相似度答案

【问题标题】：Given cluster of documents, compute similarity between corpus and the cluster给定文档集群，计算语料库和集群之间的相似度
【发布时间】：2018-11-27 19:24:37
【问题描述】：

我正在通过计算语料库中每个文档与集群之间的距离来进行相似度排名工作。集群也以文档列表的形式给出。我遇到的麻烦是我无法提出计算集群质心的正确方法，以便我可以计算相似度。我尝试使用集群的 tfidf 矩阵的平均值，但结果很差。

例如：我的集群是：

['Line a baking pan with a sheet of parchment paper.',
 'Line the cake pan with parchment paper.',
 'Line the bottom with parchment paper.',
 'Line a baking pan with parchment paper.'
]

我的语料库包含以下 3 个文档：

['Add vinegar and sugar.',
 'Remove pan from heat and let stand 5 minutes.',
 'Line the pan with parchment paper.'
]

我想计算每个文档和集群之间的相似度，这可能会产生如下结果：

[0.1, 0.1, 0.8]

你有什么建议吗？我尝试将集群和语料库文档都表示为 tfidf 矩阵，但是通过计算两个矩阵之间的相似性似乎很难给出期望的结果。我尝试了 LSI，但正是我想要排名的语料库而不是集群文档迫使我找到代表集群的质心。

【问题讨论】：

查看更新后的答案。
哇，非常感谢。没想到cosine sim可以这样计算。
不客气！快乐的日子——几行代码就可以完成过去需要一段时间才能完成的事情。

标签： python pandas numpy nltk tf-idf

【解决方案1】：

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

cluster = ['Line a baking pan with a sheet of parchment paper.',
            'Line the cake pan with parchment paper.',
            'Line the bottom with parchment paper.',
            'Line a baking pan with parchment paper.']

corpus = ['Add vinegar and sugar.',
          'Remove pan from heat and let stand 5 minutes.',
          'Line the pan with parchment paper.']

# Train tfidf on cluster
tfidf = TfidfVectorizer()
tfidf_cluster = tfidf.fit_transform(cluster)

# Tranform the corpus using the trained tfidf
tfidf_corpus = tfidf.transform(corpus)

# Cosine similarity
cos_similarity = np.dot(tfidf_corpus, tfidf_cluster.T).A
avg_similarity = np.mean(cos_similarity, axis=1)

cos_similarity
Out[271]: 
array([[0.        , 0.        , 0.        , 0.        ],
       [0.31452723, 0.36145869, 0.        , 0.43855558],
       [0.50673521, 0.8242027 , 0.7139548 , 0.70655744]])

avg_similarity
Out[272]: array([0.        , 0.27863537, 0.68786254])

【讨论】：

我的真实语料库很大，我想使用集群作为查询来对语料库文档进行排名。 tfidf 之后的语料库可能是极高维度的，因此我计划使用 LSA 来降低维度。但是，我遇到了适合LSA模型的哪一侧的问题。由于我想使用集群对语料库进行排名，并且 TFIDF 模型适合集群。我可能需要在集群上安装 LSA。但是集群有时只有 2-3 个维度，这可能在语料库方面有所暗示。现在！我想我完全糊涂了…………
您的问题表明您想比较语料库和集群中的每个文档。因此，在答案中，tfidf 适合集群，然后将语料库映射（转换）到集群空间。转换后的语料库与您的集群具有相同数量的列（单词），并且行数将对应于您的语料库中的文档数。如果你做相反的事情， tfidf.fit 在语料库上这将导致 tfidf 具有更多的列（单词）并且行数将再次等于文档数。我看不出有什么优势。