【发布时间】:2014-03-02 14:06:51
【问题描述】:
我正在使用 scikit-learn 进行文本分类。使用单个功能效果很好,但引入多个功能会给我带来错误。我认为问题在于我没有按照分类器期望的方式格式化数据。
例如,这很好用:
data = np.array(df['feature1'])
classes = label_encoder.transform(np.asarray(df['target']))
X_train, X_test, Y_train, Y_test = train_test_split(data, classes)
classifier = Pipeline(...)
classifier.fit(X_train, Y_train)
但是这个:
data = np.array(df[['feature1', 'feature2']])
classes = label_encoder.transform(np.asarray(df['target']))
X_train, X_test, Y_train, Y_test = train_test_split(data, classes)
classifier = Pipeline(...)
classifier.fit(X_train, Y_train)
死于
Traceback (most recent call last):
File "/Users/jed/Dropbox/LegalMetric/LegalMetricML/motion_classifier.py", line 157, in <module>
classifier.fit(X_train, Y_train)
File "/Library/Python/2.7/site-packages/sklearn/pipeline.py", line 130, in fit
Xt, fit_params = self._pre_transform(X, y, **fit_params)
File "/Library/Python/2.7/site-packages/sklearn/pipeline.py", line 120, in _pre_transform
Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 780, in fit_transform
vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 715, in _count_vocab
for feature in analyze(doc):
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 229, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 195, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
在调用 classifier.fit() 之后的预处理阶段。我认为问题在于我正在格式化数据,但我不知道如何正确处理。
feature1 和 feature2 都是英文文本字符串,目标也是。我正在使用 LabelEncoder() 对目标进行编码,这似乎工作正常。
这是print data 返回的示例,让您了解它现在的格式。
[['some short english text'
'a paragraph of english text']
['some more short english text'
'a second paragraph of english text']
['some more short english text'
'a third paragraph of english text']]
【问题讨论】:
-
那么,你是如何格式化数据的?我通常发现我可以将 pandas DataFrame 直接传递给
scikit函数,它工作正常。 -
我尝试将 DataFrame 直接传递给
train_test_split(),但我得到了同样的错误。train_test_split(df['feature1'], label_encoder.transform(df['target']))工作正常。train_test_split(df[['feature1', 'feature2']], label_encoder.transform(df['matches']))没有。 -
你能打印出
X_train在这两种情况下的样子吗? -
具有两个功能
X_train看起来与问题中的print data示例相同(实际上并不相同,因为它是拆分的,当然)。有一个特征X_train看起来像这样:['short english text' 'additional english text' 'more short english text' ..., 'still more short english text' 'yet more short english text' 'english text']所以有两个特征它是一个字符串数组,一个特征是一个字符串数组。大概这就是问题所在,但我不知道fit()期望它是什么样子。 -
查看文档我知道
fit()期望{array-like, sparse matrix}, shape = [n_samples, n_features]。使用两个功能打印X_train.shape会得到(4630, 2)。它的一项功能是(4630,)。所以这似乎是正确的。不知道我错过了什么。
标签: python pandas machine-learning scikit-learn