【发布时间】:2019-07-31 12:56:44
【问题描述】:
from sklearn.feature_extraction.text import TfidfVectorizer
documents=["The car is driven on the road","The truck is
driven on the highway","the lorry is"]
fidf_transformer=TfidfVectorizer(smooth_idf=True,use_idf=True)
tfidf=tfidf_transformer.fit_transform(documents)
print(tfidf_transformer.vocabulary_)
print(tfidf.toarray())
{'the': 7, 'car': 0, 'on': 5, 'driven': 1, 'is': 3, 'road': 6, 'lorry': 4, 'truck': 8, 'highway': 2}
[[0.45171082 0.34353772 0. 0.26678769 0. 0.34353772 0.45171082 0.53357537 0. ]
[0. 0.34353772 0.45171082 0.26678769 0. 0.34353772 0. 0.53357537 0.45171082]
[0. 0. 0. 0.45329466 0.76749457 0. 0. 0.45329466 0. ]]
“the”这个词在三个文档中应该是低分
【问题讨论】: