【发布时间】:2020-12-09 01:15:56
【问题描述】:
如何对多个文档的语料库中的多个单词列表进行计数和评分,以便您可以通过几种不同的方式执行排序?
- 在语料库中查找文档并在列表中查找和排序最相似的单词
sort by most red
'i ate a red apple.'
'the kid read the book the little red riding hood',
- 还能够找到与给定文档最接近的文档。
most similar to doc 0
'i ate a red apple.'
'the kid read the book the little red riding hood',
例如
colors = ['red', 'blue', 'yellow' , 'purple']
things = ['apple', 'pickle', 'tomato' , 'rainbow', 'book']
corpus = ['i ate a red apple.', 'There are so many colors in the rainbow.', 'the monster was purple and green.', 'the pickle is very green', 'the kid read the book the little red riding hood', 'in the book the wizard of oz there was a yellow brick road.', 'tom has a green thumb and likes working in a garden.' ]
colors = ['red', 'blue', 'yellow' , 'purple']
things = ['apple', 'pickle', 'tomato' , 'rainbow', 'book']
0 1 2 3 4 5 6
我做个柜台
# 0 'i ate a red apple.'
['red': 1, 'blue': 0, 'yellow' : 0, 'purple': 0]
['apple': 1, 'pickle': 0, 'tomato': 0, 'rainbow': 0, 'book': 0]
# 1 'There are so many colors in the rainbow.'
['red': 0, 'blue': 0, 'yellow' : 0, 'purple': 0]
['apple': 0, 'pickle': 0, 'tomato': 0, 'rainbow': 1, 'book': 0]
# 2 the monster was purple and green.'
['red': 0, 'blue': 0, 'yellow' : 0, 'purple': 1]
['apple': 0, 'pickle': 0, 'tomato': 0, 'rainbow': 0, 'book': 0]
# 3 'the pickle is very green',
['red': 0, 'blue': 0, 'yellow' : 0, 'purple': 0]
['apple': 0, 'pickle': 1, 'tomato': 0, 'rainbow': 0, 'book': 0]
# 4 'the kid read the book the little red riding hood',
['red': 1 'blue': 0, 'yellow' : 0, 'purple': 0]
['apple': 0, 'pickle': 0, 'tomato': 0, 'rainbow': 0, 'book': 1]
# 5 'in the book the wizard of oz there was a yellow brick road.',
['red': 0, 'blue': 0, 'yellow' : 1, 'purple': 0]
['apple': 0, 'pickle': 0, 'tomato': 0, 'rainbow': 0, 'book': 1]
# 6 'tom has a green thumb and likes working in a garden.'
['red': 0, 'blue': 0, 'yellow' : 0, 'purple': 0]
['apple': 0, 'pickle': 0, 'tomato': 0, 'rainbow': 0, 'book': 0]
或者一个颜色数组和一个东西数组
# colors
0 1 2 3 4 5 6
red 1 0 0 0 1 0 0
blue 0 0 0 0 0 0 0
yellow 0 0 0 0 0 1 0
purple 0 0 1 0 0 0 0
# things
0 1 2 3 4 5 6
apple 1 0 0 0 1 0 0
pickle 0 0 0 1 0 0 0
tomato 0 0 0 0 0 0 0
rainbow 0 0 1 0 0 0 0
book 0 0 0 0 1 1 0
然后找到最相似的或按最接近的数字排序
sort by most red
'i ate a red apple.'
'the kid read the book the little red riding hood',
most similar to doc 0
'i ate a red apple.'
'the kid read the book the little red riding hood',
或者我应该使用 doc2vec 还是完全不同的东西?
【问题讨论】:
-
只是为了澄清一下,您想要两个文档之间的相似性仅基于颜色和事物?或者您只是想要基于所有单词共现的相似句子?或者您想要基于上下文的相似句子(颜色、蜡笔、蓝色等与苹果、香蕉、水果、沙拉相比具有相似的上下文)
-
每个文档作为一个整体只是一个玩具示例。在实际使用中,它们将是 10 个或表示心情、主题等的单词列表,例如快乐、悲伤或其他。我正在尝试计数以查找语料库中每个文档的相似性并将其排序为单词列表。
-
所以您有大量主题集,并且您正在根据这些特定主题中的单词找到相似之处?那么,一组主题可能是Mood,你想找到情绪生气的句子吗?
-
是的,但是已经做了情绪分析,这是为了最终与过滤协作。我有大量的评论,他们希望根据这些特定单词列表中的单词找到相似之处。我想通过文档中的相似性(每个人都有一个文档或一列)对人进行分类,情绪只是一个特定单词的列表。所以我可以按最类似于list1或list2的方式排序。或最类似于 doc1 或人名。
-
检查我的答案,您的问题中有更多相同问题的变体。所有这些都是相同的修改