【问题标题】:Find and sort most similar to a list of specific words to a corpus of documents查找和排序与文档语料库中的特定单词列表最相似的
【发布时间】:2020-12-09 01:15:56
【问题描述】:

如何对多个文档的语料库中的多个单词列表进行计数和评分,以便您可以通过几种不同的方式执行排序?

  1. 在语料库中查找文档并在列表中查找和排序最相似的单词
sort by most red
'i ate a red apple.'
'the kid read the book the little red riding hood', 
  1. 能够找到与给定文档最接近的文档
most similar to doc 0
'i ate a red apple.'
'the kid read the book the little red riding hood', 

例如

colors  = ['red', 'blue', 'yellow' , 'purple']
things = ['apple', 'pickle', 'tomato' , 'rainbow', 'book']

corpus = ['i ate a red apple.', 'There are so many colors in the rainbow.', 'the monster was purple and green.', 'the pickle is very green', 'the kid read the book the little red riding hood', 'in the book the wizard of oz there was a yellow brick road.', 'tom has a green thumb and likes working in a garden.' ]

colors  = ['red', 'blue', 'yellow' , 'purple']
things = ['apple', 'pickle', 'tomato' , 'rainbow', 'book']
 
     0    1    2    3    4    5    6

我做个柜台

# 0 'i ate a red apple.'
['red': 1, 'blue': 0, 'yellow' : 0, 'purple': 0]
['apple': 1, 'pickle': 0, 'tomato': 0, 'rainbow': 0, 'book': 0]

# 1 'There are so many colors in the rainbow.'
['red': 0, 'blue': 0, 'yellow' : 0, 'purple': 0]
['apple': 0, 'pickle': 0, 'tomato': 0, 'rainbow': 1, 'book': 0]

# 2 the monster was purple and green.'
['red': 0, 'blue': 0, 'yellow' : 0, 'purple': 1]
['apple': 0, 'pickle': 0, 'tomato': 0, 'rainbow': 0, 'book': 0]

# 3 'the pickle is very green', 
['red': 0, 'blue': 0, 'yellow' : 0, 'purple': 0]
['apple': 0, 'pickle': 1, 'tomato': 0, 'rainbow': 0, 'book': 0]

# 4 'the kid read the book the little red riding hood', 
['red': 1 'blue': 0, 'yellow' : 0, 'purple': 0]
['apple': 0, 'pickle': 0, 'tomato': 0, 'rainbow': 0, 'book': 1]

# 5 'in the book the wizard of oz there was a yellow brick road.', 
['red': 0, 'blue': 0, 'yellow' : 1, 'purple': 0]
['apple': 0, 'pickle': 0, 'tomato': 0, 'rainbow': 0, 'book': 1]

# 6 'tom has a green thumb and likes working in a garden.' 
['red': 0, 'blue': 0, 'yellow' : 0, 'purple': 0]
['apple': 0, 'pickle': 0, 'tomato': 0, 'rainbow': 0, 'book': 0]

或者一个颜色数组和一个东西数组

# colors
         0    1    2    3    4    5    6
red      1    0    0    0    1    0    0
blue     0    0    0    0    0    0    0
yellow   0    0    0    0    0    1    0
purple   0    0    1    0    0    0    0
# things
          0    1    2    3    4    5    6
apple     1    0    0    0    1    0    0
pickle    0    0    0    1    0    0    0
tomato    0    0    0    0    0    0    0
rainbow   0    0    1    0    0    0    0
book      0    0    0    0    1    1    0

然后找到最相似的或按最接近的数字排序

sort by most red
'i ate a red apple.'
'the kid read the book the little red riding hood', 
most similar to doc 0
'i ate a red apple.'
'the kid read the book the little red riding hood', 

或者我应该使用 doc2vec 还是完全不同的东西?

【问题讨论】:

  • 只是为了澄清一下,您想要两个文档之间的相似性仅基于颜色和事物?或者您只是想要基于所有单词共现的相似句子?或者您想要基于上下文的相似句子(颜色、蜡笔、蓝色等与苹果、香蕉、水果、沙拉相比具有相似的上下文)
  • 每个文档作为一个整体只是一个玩具示例。在实际使用中,它们将是 10 个或表示心情、主题等的单词列表,例如快乐、悲伤或其他。我正在尝试计数以查找语料库中每个文档的相似性并将其排序为单词列表。
  • 所以您有大量主题集,并且您正在根据这些特定主题中的单词找到相似之处?那么,一组主题可能是Mood,你想找到情绪生气的句子吗?
  • 是的,但是已经做了情绪分析,这是为了最终与过滤协作。我有大量的评论,他们希望根据这些特定单词列表中的单词找到相似之处。我想通过文档中的相似性(每个人都有一个文档或一列)对人进行分类,情绪只是一个特定单词的列表。所以我可以按最类似于list1或list2的方式排序。或最类似于 doc1 或人名。
  • 检查我的答案,您的问题中有更多相同问题的变体。所有这些都是相同的修改

标签: python pandas nlp


【解决方案1】:

IIUC,你有一堆主题,比如颜色、事物、情绪等,每个主题都有一些关键词。您希望根据给定主题中关键字的出现次数来查找句子之间的相似性。

您可以分两步完成 -

  1. 拟合计数向量器以获取所有唯一单词的单词出现次数
  2. 仅针对主题中存在的关键字对其进行过滤
  3. 对该主题的单词出现次数(句子 * 主题)点(主题 * 句子)进行点积,得到一个(句子 * 句子)矩阵,该矩阵与该主题的 2 个句子之间的余弦相似度相同(非-标准化)
  4. 转到特定行,获取该行中相似度得分最高的句子(同一句子除外)
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
out = cv.fit_transform(corpus).toarray() #apply countvectorizer

#For scalability (because you can have a lot more topics like Mood etc) I am combining all topics first and later ill filter by given topic
combined = colors+things  #combine all your topics

c = [(k,v) for k,v in cv.vocabulary_.items() if k in combined] #get indexes for all the items from all topics

cdf = pd.DataFrame(out[:,[i[1] for i in c]], columns=[i[0] for i in c]).T  #Filter cv dataframe for all items

print(cdf)
#This results in a keyword occurance dataset with all keywords from all topics
         0  1  2  3  4  5  6
red      1  0  0  0  1  0  0
apple    1  0  0  0  0  0  0
rainbow  0  1  0  0  0  0  0
purple   0  0  1  0  0  0  0
pickle   0  0  0  1  0  0  0
book     0  0  0  0  1  1  0
yellow   0  0  0  0  0  1  0

现在,对于下一步,按主题(颜色或事物等)对其进行过滤,并获取该矩阵的余弦相似度(归一化点积)。这可以通过这个函数来完成 -

def get_similary_table(topic):
    df = cdf.loc[cdf.index.isin(topic)]  #filter by topic
    cnd = df.values
    similarity = cnd.T@cnd #Take dot product to get similarty matrix
    dd = pd.DataFrame(similarity, index=corpus, columns=corpus) #convert to a dataframe
    return dd

get_similary_table(things)

如果您在此表中看到单行,则具有最高值的列最相似。因此,如果您想要最相似的,只需取一个最大值,或者如果您想要前 5 个,则排序并取前 5 个值(及其对应的列)

这是获取与给定句子最相似的句子的代码

def get_similar_review(s, topic):
    df = cdf.loc[cdf.index.isin(topic)] #filter by topic
    cnd = df.values
    similarity = cnd.T@cnd #Take dot product to get similarty matrix
    np.fill_diagonal(similarity,0) #set diagonal elements to 0, to avoid same sentence being returned as output
    dd = pd.DataFrame(similarity, index=corpus, columns=corpus) #convert to a dataframe
    return dd.loc[s].idxmax(axis=0) #filter by sentence and get column name with max value
s = 'i ate a red apple.'
get_similar(s, colors)

#'the kid read the book the little red riding hood'
s = 'the kid read the book the little red riding hood'
get_similar(s, things)

#'in the book the wizard of oz there was a yellow brick road.'

如果你不想通过主题找到相似度,那么你可以简单地忽略大部分步骤,直接取CountVectorized矩阵cv,取其点积得到(句子*句子)矩阵,得到相似度矩阵

【讨论】:

    【解决方案2】:

    您可以通过在每一行上迭代并按单词分组来获得计数来实现这一点

    def words_counter(corpus_parameter, colors_par, things_par):
        """ Returns two dataframes with the occurrence of the words in colors_par & things_par
        corpus_parameter: list of strings, common language
        colors_par: list of words with no spaces or punctuation
        things_par: list of words with no spaces or punctuation
        """
        colors_count, things_count = [], [] # lists to collect intermediate series
        for i, line in enumerate(corpus):
            words = pd.Series(
                line
                .strip(' !?.') # it will remove any spaces or punctuation from left/right of the string
                .lower() # use this to count 'red', 'Red', and 'RED' as the same word
                .split() # split using spaces (' ') by default, you can provide a different character
            ) # returns a clean series with all the words
            # print(words) # uncomment to see the series
            words = words.groupby(words).size() # returns the words as index and the count as values
            # print(words) # uncomment to see the series
            colors_count.append(words.loc[words.index.isin(colors_par)])
            things_count.append(words.loc[words.index.isin(things_par)])
            
        colors_count = (
            pd.concat(colors_count, axis=1) # convert list of series to dataframe
            .reindex(colors_par) # include colors with zero occurrence
            .fillna(0) # get rid of NaNs
            .astype(int) # convert from default float to integer
        )
        things_count = pd.concat(things_count, axis=1).reindex(things_par).fillna(0).astype(int)
            
        print(colors_count)
        print(things_count)
        return(colors_count, things_count)
    

    用线路调用它

    words_counter(corpus, colors, things)
    

    输出

            0  1  2  3  4  5  6
    red     1  0  0  0  1  0  0
    blue    0  0  0  0  0  0  0
    yellow  0  0  0  0  0  1  0
    purple  0  0  1  0  0  0  0
    
             0  1  2  3  4  5  6
    apple    1  0  0  0  0  0  0
    pickle   0  0  0  1  0  0  0
    tomato   0  0  0  0  0  0  0
    rainbow  0  1  0  0  0  0  0
    book     0  0  0  0  1  1  0
    

    【讨论】:

    • 感谢您的帮助,有些方法要快得多。我学到了很多。我真的很感激。
    猜你喜欢
    • 2020-10-28
    • 1970-01-01
    • 2021-12-15
    • 1970-01-01
    • 2023-03-06
    • 1970-01-01
    • 2015-10-06
    • 2018-09-05
    • 1970-01-01
    相关资源
    最近更新 更多