【发布时间】:2015-02-19 00:06:04
【问题描述】:
假设我在桌面的一个文件夹中有不同的 .txt 文件。它们看起来像这样。
文件_1:
('this', 'is'), ('a', 'very'),....., ('large', '.txt'), ('file', 'with'), ('lots', 'of'), ('words', 'like'), ('this', 'i'), ('would', 'like'), ('to', 'create'), ('a', 'matrix'),'LABEL_1'
...
文件_N:
('this', 'is'), ('a', 'another'),....., ('large', '.txt'), ('file', 'with'), ('lots', 'of'), ('words', 'like'), ('this', 'i'), ('would', 'like'), ('to', 'create'), ('a', 'matrix'),'LABEL_N'
从documentation,scikit-learn 提供load_files,我可以使用散列技巧进行矢量化,如下所示:
from sklearn.feature_extraction.text import FeatureHasher
from sklearn.svm import SVC
training_data = [[('string1', 'string2'), ('string3', 'string4'),
('string5', 'string6'), 'POS'],
[('string1', 'string2'), ('string3', 'string4'), 'NEG']]
feature_hasher_vect = FeatureHasher(input_type ='string')
X = feature_hasher_vect.transform(((' '.join(x) for x in sample)
for sample in training_data))
print X.toarray()
输出:
[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
如何使用load_files() 或任何其他方法将整个 .txt 文件夹矢量化(应用上述相同过程)?
【问题讨论】:
标签: python python-2.7 machine-learning scikit-learn nltk