【问题标题】:TF-IDF in python and not desired resultspython中的TF-IDF和不想要的结果
【发布时间】:2013-09-23 15:35:22
【问题描述】:

我在网上找到了一个用于计算 tf-idf 和余弦相似度的 python 教程。我正在尝试使用它并对其进行一些更改。

问题是我得到了奇怪的结果,几乎没有任何意义。

例如,我使用 3 个文档。 [doc1,doc2,doc3] doc1 和 doc2 是相似的,而 doc3 是完全不同的。

结果在这里:

[[  0.00000000e+00   2.20351188e-01   9.04357868e-01]
 [  2.20351188e-01  -2.22044605e-16   8.82546765e-01]
 [  9.04357868e-01   8.82546765e-01  -2.22044605e-16]]

首先,我认为主对角线上的数字应该是 1 而不是 0。之后,doc1 和 doc2 的相似度得分约为 0.22,doc1 与 doc3 的相似度得分约为 0.90。我期待相反的结果。您能否检查我的代码并帮助我理解为什么我会得到这些结果?

Doc1、doc2 和 doc3 是标记化的文本。

articles = [doc1,doc2,doc3]

corpus = []
for article in articles:
    for word in article:
        corpus.append(word)


def freq(word, article):
    return article.count(word)

def wordCount(article):
    return len(article)

def numDocsContaining(word,articles):
  count = 0
  for article in articles:
    if word in article:
      count += 1
  return count

def tf(word, article):
    return (freq(word,article) / float(wordCount(article)))

def idf(word, articles):
    return math.log(len(articles) / (1 + numDocsContaining(word,articles)))

def tfidf(word, document, documentList):
    return (tf(word,document) * idf(word,documentList))

feature_vectors=[]

for article in articles:
    vec=[]
    for word in corpus:
        if word in article:
            vec.append(tfidf(word, article, corpus))
        else:
            vec.append(0)
    feature_vectors.append(vec)

n=len(articles)

mat = numpy.empty((n, n))
for i in xrange(0,n):
    for j in xrange(0,n):
       mat[i][j] = nltk.cluster.util.cosine_distance(feature_vectors[i],feature_vectors[j])

print mat

【问题讨论】:

    标签: python similarity tf-idf


    【解决方案1】:

    如果您可以尝试任何其他软件包,例如 sklearn,请尝试一下

    此代码可能会有所帮助

    from sklearn.feature_extraction.text import TfidfTransformer
    from nltk.corpus import stopwords
    from sklearn.metrics.pairwise import cosine_similarity
    import numpy as np
    import numpy.linalg as LA
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    
    f = open("/root/Myfolder/scoringDocuments/doc1")
    doc1 = str.decode(f.read(), "UTF-8", "ignore")
    f = open("/root/Myfolder/scoringDocuments/doc2")
    doc2 = str.decode(f.read(), "UTF-8", "ignore")
    f = open("/root/Myfolder/scoringDocuments/doc3")
    doc3 = str.decode(f.read(), "UTF-8", "ignore")
    
    train_set = [doc1, doc2, doc3]
    
    test_set = ["age salman khan wife"] #Query 
    stopWords = stopwords.words('english')
    
    tfidf_vectorizer = TfidfVectorizer(stop_words = stopWords)
    tfidf_matrix_test =  tfidf_vectorizer.fit_transform(test_set)
    print tfidf_vectorizer.vocabulary_
    tfidf_matrix_train = tfidf_vectorizer.transform(train_set) #finds the tfidf score with normalization
    print 'Fit Vectorizer to train set', tfidf_matrix_train.todense()
    print 'Transform Vectorizer to test set', tfidf_matrix_test.todense()
    
    print "\n\ncosine simlarity not separated sets cosine scores ==> ", cosine_similarity(tfidf_matrix_test, tfidf_matrix_train)
    

    参考本教程part-I,part-II,part-III。这会有所帮助。

    【讨论】:

    • 我已经尝试过这个库。问题是我想使用自己的函数来准备文本(删除停用词和词干)
    猜你喜欢
    • 2017-09-17
    • 2020-11-08
    • 2016-10-13
    • 2012-07-02
    • 2015-04-17
    • 2020-09-27
    • 2017-07-01
    • 2020-08-06
    • 2018-08-22
    相关资源
    最近更新 更多