【发布时间】:2017-05-19 20:38:52
【问题描述】:
我正在使用 NLTK 来计算单词的 tf_idf。但大部分都是0分。
def compute_tf_idf(corpus,source_text):
texts = []
for text in corpus:
if text['text'] != None:
try:
language = detect_lang(text['text'])
except Exception as e:
language = None
# French analysing
if language == "french":
french_analyser = AnalyseFrenchText(text['text'])
french_analyser.analysetext()
tokenized_text = french_analyser.get_tokenized_text()
if tokenized_text != None:
texts.append(tokenized_text)
textCorpus = TextCollection(texts)
for word in textCorpus[:100]:
print(word) # print correctly words
try:
language = detect_lang(source_text)
except Exception as e:
language = None
# French analysing
if language == "french":
french_analyser = AnalyseFrenchText(source_text)
french_analyser.analysetext()
tokenized_source_text = french_analyser.get_tokenized_text()
for word in tokenized_source_text:
print(word)
print("idf :" + str(textCorpus.idf(word)))
print("tf : " + str(textCorpus.tf(word,tokenized_source_text)))
print("tf_idf :" + str(textCorpus.tf_idf(word,tokenized_source_text)))
return
结果:
Commande
idf :0.0
tf : 0.0024875621890547263
tf_idf :0.0
我检查了用于计算 idf 的 NLTK 源:
""" The number of texts in the corpus divided by the
number of texts that the term appears in.
If a term does not appear in the corpus, 0.0 is returned. """
我用错了 NLTK 的 tf_idf 吗? 谢谢
【问题讨论】:
-
您能发布完整代码或代码链接吗?目前,鉴于您发布的代码 sn-p ,尚不清楚问题可能出在哪里。另外,如果可能的话,将您的语料库样本发布在某个地方,否则也不清楚。
标签: python python-3.x nlp nltk