【发布时间】:2019-12-07 20:56:09
【问题描述】:
我正在做一些 nlp 分类,我想做一个 stacking ensemble。
我的原始数据包含每个类的不同级别的描述。例如,对于一个实例,我们最初可以有一列带有其名称,一列带有简短描述,一列带有其子类别的描述,依此类推。
我在上面的代码中的 X_train 是每列包含所有单词的粒度。例如。第一列可以是简短的描述,第二列是子类别描述和来自另一个来源的词,第三列是来自更细化类别的更多词。
我将 pipe、pipe_2 的工作流程包含在 StackingClassifier 中,因为这是我想要做的,但如果我只是尝试运行 @987654324,我会遇到同样的错误@ 作为独立的(直接适合pipe_1)。
我尝试更改X_train 和y_train 格式(使用ravel() 和.tolist()),但我认为当管道使用ColumnSelector 时可能会出现格式问题,并且我不确定如何处理。
X_train(<class 'pandas.core.frame.DataFrame'>) 和y_train(<class 'pandas.core.series.Series'>) 的类型与我成功进行非堆叠运行时的类型相同。为了成功运行,传递给 fit 方法的是 <class 'scipy.sparse.csr.csr_matrix'>。我想在堆叠示例中也是如此,我是否希望 TfidfVectorizer 能够实现这一点。我看到的主要区别(我认为它可能会在每行创建有问题的numpy.ndarray,因为有不止一列?)是对于堆叠的,X_train 有不止一列。但我原以为make_pipeline 中的ColumnSelector 会“解决这个问题”。
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline
from mlxtend.feature_selection import ColumnSelector
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from mlxtend.classifier import StackingClassifier
# creating my toy trainset and testset
start = [
['apple this is painful Two wrongs make a right ok',
'just a batch of suspicious words and banana',
'another batch of fake words and another apple'],
['Fortune favors the italic sunny sunshine',
'name of a company and then its description',
'is it all sunshine or doomed to fail to no sunshine'],
['this was it when in rome do as the romans and make fortune',
'well again the same thing and those descriptions',
'lets make that work and bring the fortune'],
['Ok this is the last one and then its the end',
'is it the beggining of the end or the end of the beggining',
'allelouia']
]
X_train = pd.DataFrame(
start, columns=['High_level', 'Mid_level', 'Low_level'])
y_train = ['A', 'B', 'C', 'D']
X_test = pd.DataFrame([['mostly apple'], ['bunch of apple'],
['lot of fortune'], ['make fortune and bring the'],
['beggining of the end']])
y_true = ['A', 'A', 'C', 'C', 'D']
错误出现在下一行:
pipe_1 = make_pipeline(ColumnSelector(cols=(1,)), TfidfVectorizer(min_df=1),
LogisticRegression(multi_class='multinomial'))
pipe_2 = make_pipeline(ColumnSelector(cols=(2,)), TfidfVectorizer(min_df=1),
LogisticRegression(multi_class='multinomial'))
sclf = StackingClassifier(
classifiers=[pipe_1, pipe_2],
meta_classifier=LogisticRegression(
solver='lbfgs', multi_class='multinomial',
C=1.0, class_weight='balanced', tol=1e-6, max_iter=1000,
n_jobs=-1))
predictions = sclf.fit(X_train, y_train).predict(X_test)
这是完整的错误:
Traceback (most recent call last):
File "C:/Users/inf10926/PycharmProjects/profiling/venv/lab.py", line 52, in <module>
predictions = sclf.fit(X_train, y_train).predict(X_test)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\mlxtend\classifier\stacking_classification.py", line 161, in fit
clf.fit(X, y)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 352, in fit
Xt, fit_params = self._fit(X, y, **fit_params)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 317, in _fit
**fit_params_steps[name])
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\joblib\memory.py", line 355, in __call__
return self.func(*args, **kwargs)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 716, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1652, in fit_transform
X = super().fit_transform(raw_documents)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1058, in fit_transform
self.fixed_vocabulary_)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 970, in _count_vocab
for feature in analyze(doc):
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 352, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 256, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
Process finished with exit code 1
如果我在TfidfVectorizer 中更改为lowercase=False,我会得到另一种错误:
Traceback (most recent call last):
File "C:/Users/inf10926/PycharmProjects/profiling/venv/lab.py", line 52, in <module>
predictions = sclf.fit(X_train, y_train).predict(X_test)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\mlxtend\classifier\stacking_classification.py", line 161, in fit
clf.fit(X, y)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 352, in fit
Xt, fit_params = self._fit(X, y, **fit_params)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 317, in _fit
**fit_params_steps[name])
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\joblib\memory.py", line 355, in __call__
return self.func(*args, **kwargs)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 716, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1652, in fit_transform
X = super().fit_transform(raw_documents)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1058, in fit_transform
self.fixed_vocabulary_)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 970, in _count_vocab
for feature in analyze(doc):
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 352, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 265, in <lambda>
return lambda doc: token_pattern.findall(doc)
TypeError: cannot use a string pattern on a bytes-like object
【问题讨论】:
-
看起来像是在期待一个字符串
-
@Mad Phycicist:感谢您的光临。是的,我猜通常在一列中它是一个单词列表,它“自动转换成一个巨大的字符串”,或者类似的东西。但是双列问题可能会更改该列表或 numpy.darray 中的任何内容。不知道什么是解决方法
标签: python-3.x scikit-learn pipeline attributeerror tfidfvectorizer