【问题标题】:How to write a method that returns cosine similarity between two documents如何编写返回两个文档之间余弦相似度的方法
【发布时间】:2021-08-17 21:34:43
【问题描述】:

我正在编写一个返回两个文档之间余弦相似度的方法。使用 sklearn CountVectorizer() 我试过了

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def doc_cos_similar(doc1:str, doc2:str) -> float:
  vectorizer= CountVectorizer()
  doc1="Good morning"
  doc2="Good evening"
  documents = [doc1, doc2]
  count_vectorizer = CountVectorizer()
  sparse_matrix = count_vectorizer.fit_transform(documents)
  doc_term_matrix = sparse_matrix.todense()
  return doc_term_matrix

#输入

doc1="Good morning"
doc2="Good afternoon"

输出应该是 0.60(类似的)

但是输出是一个

矩阵([[0, 1, 1], [1, 1, 0]])

【问题讨论】:

    标签: python-3.x nlp cosine-similarity countvectorizer


    【解决方案1】:

    你快到了。

    cosine_similarity(doc_term_matrix) 返回

    array([[1. , 0.5],
           [0.5, 1. ]])
    

    所以你可以使用cosine_similarity(doc_term_matrix)[0][1](或[1][0],没关系,因为余弦是对称的)。

    附:您应该将 doc1doc2 作为参数传递,而不是对它们进行硬编码。

    【讨论】:

      【解决方案2】:

      你可以试试这个:

      from nltk.corpus import stopwords
      from nltk.tokenize import word_tokenize
      
      # X = input("Enter first string: ").lower()
      # Y = input("Enter second string: ").lower()
      X ="Good morning! Welcome"
      Y ="Good evening! Welcome"
      
      # tokenization
      X_list = word_tokenize(X)
      Y_list = word_tokenize(Y)
      
      # sw contains the list of stopwords
      sw = stopwords.words('english')
      l1 =[];l2 =[]
      
      # remove stop words from the string
      X_set = {w for w in X_list if not w in sw}
      Y_set = {w for w in Y_list if not w in sw}
      
      # form a set containing keywords of both strings
      rvector = X_set.union(Y_set)
      for w in rvector:
        if w in X_set: l1.append(1) # create a vector
        else: l1.append(0)
        if w in Y_set: l2.append(1)
        else: l2.append(0)
      c = 0
      
      # cosine formula
      for i in range(len(rvector)):
        c+= l1[i]*l2[i]
      cosine = c / float((sum(l1)*sum(l2))**0.5)
      print("similarity: ", cosine)
      

      【讨论】:

        猜你喜欢
        • 2014-02-25
        • 2017-09-01
        • 2018-09-27
        • 2010-12-23
        • 2020-10-28
        • 2019-04-12
        • 2018-01-03
        • 2017-08-21
        • 2014-02-25
        相关资源
        最近更新 更多