【发布时间】:2014-04-21 11:04:37
【问题描述】:
所以,我复制了一个关于如何创建可以运行 tf-idf 的系统的源代码,代码如下:
#module import
from __future__ import division, unicode_literals
import math
import string
import re
import os
from text.blob import TextBlob as tb
#create a new array
words = {}
def tf(word, blob):
return blob.words.count(word) / len(blob.words)
def n_containing(word, bloblist):
return sum(1 for blob in bloblist if word in blob)
def idf(word, bloblist):
return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))
def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)
regex = re.compile('[%s]' % re.escape(string.punctuation))
f = open('D:/article/sport/a.txt','r')
var = f.read()
var = regex.sub(' ', var)
var = var.lower()
document1 = tb(var)
f = open('D:/article/food/b.txt','r')
var = f.read()
var = var.lower()
document2 = tb(var)
bloblist = [document1, document2]
for i, blob in enumerate(bloblist):
print("Top words in document {}".format(i + 1))
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_words[:50]:
print("Word: {}, TF-IDF: {}".format(word, round(score, 5)))
但是,问题是,我想将所有文件放在语料库中的运动文件夹中,并且 食物文件夹中的食物文章到另一个语料库中,因此系统将为每个语料库给出一个结果。现在,我只能比较文件,但我想在语料库之间进行比较。很抱歉提出这个问题,如有任何帮助,将不胜感激。
谢谢
【问题讨论】:
-
我不小心按下了按钮:p