从相似度 numpy.ndarray 中获取 top-K 相关文档答案

【问题标题】：Getting the top-K relevant document from a similarity numpy.ndarray从相似度 numpy.ndarray 中获取 top-K 相关文档
【发布时间】：2013-06-07 22:09:24
【问题描述】：

我正在使用here 定义的文档相似性。

我的问题是如何从numpy.ndarray 获取最相关的文档有没有办法对 numpy 数组进行排序并获取相似的前 K 个相关文档？

这里是示例代码。

from sklearn.feature_extraction.text import TfidfVectorizer

poem = ["All the world's a stage",
"And all the men and women merely players",
"They have their exits and their entrances",
"And one man in his time plays many parts",
"His acts being seven ages. At first, the infant",
"Mewling and puking in the nurse's arms",
"And then the whining school-boy, with his satchel",
"And shining morning face, creeping like snail",
"Unwillingly to school. And then the lover",
"Sighing like furnace, with a woeful ballad",
"Made to his mistress' eyebrow. Then a soldier",
"Full of strange oaths and bearded like the pard",
"Jealous in honour, sudden and quick in quarrel",
"Seeking the bubble reputation",
"Even in the cannon's mouth. And then the justice",
"In fair round belly with good capon lined",
"With eyes severe and beard of formal cut",
"Full of wise saws and modern instances",
"And so he plays his part. The sixth age shifts",
"Into the lean and slipper'd pantaloon",
"With spectacles on nose and pouch on side",
"His youthful hose, well saved, a world too wide",
"For his shrunk shank; and his big manly voice",
"Turning again toward childish treble, pipes",
"And whistles in his sound. Last scene of all",
"That ends this strange eventful history",
"Is second childishness and mere oblivion",
"Sans teeth, sans eyes, sans taste, sans everything"]


vect = TfidfVectorizer(min_df=1)
tfidf = vect.fit_transform(poem) 

result = (tfidf * tfidf.T).A

print(type(result))

print(result)

【问题讨论】：

标签： python numpy scikit-learn cosine-similarity

【解决方案1】：

将 diag 元素设置为零，然后使用argsort() 查找展平数组中的 top-K 索引，并使用unravel_index() 将一维索引转换为二维索引：

result[np.diag_indices_from(result)] = 0.0
idx = np.argsort(result, axis=None)[-10:]
midx = np.unravel_index(idx, result.shape)
print midx
print result[midx]

结果：

(array([ 8, 14, 1, 0, 11, 17, 8, 10, 6, 8]), array([14, 8, 0, 1, 17, 11, 10, 8, 8, 6])) [ 0.2329741 0.2329741 0.2379527 0.2379527 0.25723394 0.25723394 0.26570327 0.26570327 0.34954834 0.34954834]

【讨论】：