【发布时间】:2015-02-22 06:29:33
【问题描述】:
我有一个充满.txt 文件(文档)的目录。首先我load文件,去掉一些括号,去掉一些引号,所以文件如下所示,例如:
document1:
is a scientific discipline that explores the construction and study of algorithms that can learn from data Such algorithms operate by building a model
document2:
Machine learning can be considered a subfield of computer science and statistics It has strong ties to artificial intelligence and optimization which deliver methods
所以我从这样的目录中加载文件:
preprocessDocuments =[[' '.join(x) for x in sample[:-1]] for sample in load(directory)]
documents = ''.join( i for i in ''.join(str(v) for v
in preprocessDocuments) if i not in "',()")
然后我尝试对document1 和document2 进行矢量化,以创建如下训练矩阵:
from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer(analyzer='word')
X = HashingVectorizer.fit_transform(documents)
X.toarray()
那么这是输出:
raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words
鉴于此,我如何创建矢量表示?我以为我在documents 中携带加载的文件,但似乎无法安装文件。
【问题讨论】:
标签: python machine-learning nlp scikit-learn