AttributeError：“numpy.ndarray”对象在管道中没有属性“lower”答案

【问题标题】：AttributeError: 'numpy.ndarray' object has no attribute 'lower' in pipelineAttributeError：“numpy.ndarray”对象在管道中没有属性“lower”
【发布时间】：2019-12-07 20:56:09
【问题描述】：

我正在做一些 nlp 分类，我想做一个 stacking ensemble。

我的原始数据包含每个类的不同级别的描述。例如，对于一个实例，我们最初可以有一列带有其名称，一列带有简短描述，一列带有其子类别的描述，依此类推。

我在上面的代码中的 X_train 是每列包含所有单词的粒度。例如。第一列可以是简短的描述，第二列是子类别描述和来自另一个来源的词，第三列是来自更细化类别的更多词。

我将 pipe、pipe_2 的工作流程包含在 StackingClassifier 中，因为这是我想要做的，但如果我只是尝试运行 @987654324，我会遇到同样的错误@ 作为独立的（直接适合pipe_1）。

我尝试更改X_train 和y_train 格式（使用ravel() 和.tolist()），但我认为当管道使用ColumnSelector 时可能会出现格式问题，并且我不确定如何处理。

X_train(<class 'pandas.core.frame.DataFrame'>) 和y_train(<class 'pandas.core.series.Series'>) 的类型与我成功进行非堆叠运行时的类型相同。为了成功运行，传递给 fit 方法的是 <class 'scipy.sparse.csr.csr_matrix'>。我想在堆叠示例中也是如此，我是否希望 TfidfVectorizer 能够实现这一点。我看到的主要区别（我认为它可能会在每行创建有问题的numpy.ndarray，因为有不止一列？）是对于堆叠的，X_train 有不止一列。但我原以为make_pipeline 中的ColumnSelector 会“解决这个问题”。

import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline
from mlxtend.feature_selection import ColumnSelector
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from mlxtend.classifier import StackingClassifier

# creating my toy trainset and testset
start = [
    ['apple this is painful Two wrongs make a right ok',
     'just a batch of suspicious words and banana',
     'another batch of fake words and another apple'],
    ['Fortune favors the italic sunny sunshine',
     'name of a company and then its description',
     'is it all sunshine or doomed to fail to no sunshine'],
    ['this was it when in rome do as the romans and make fortune',
     'well again the same thing and those descriptions',
     'lets make that work and bring the fortune'],
    ['Ok this is the last one and then its the end',
     'is it the beggining of the end or the end of the beggining',
     'allelouia']
]

X_train = pd.DataFrame(
    start, columns=['High_level', 'Mid_level', 'Low_level'])
y_train = ['A', 'B', 'C', 'D']
X_test = pd.DataFrame([['mostly apple'], ['bunch of apple'],
                       ['lot of fortune'], ['make fortune and bring the'],
                       ['beggining of the end']])
y_true = ['A', 'A', 'C', 'C', 'D']

错误出现在下一行：

pipe_1 = make_pipeline(ColumnSelector(cols=(1,)), TfidfVectorizer(min_df=1),
                     LogisticRegression(multi_class='multinomial'))
pipe_2 = make_pipeline(ColumnSelector(cols=(2,)), TfidfVectorizer(min_df=1),
                     LogisticRegression(multi_class='multinomial'))
sclf = StackingClassifier(
        classifiers=[pipe_1, pipe_2],
        meta_classifier=LogisticRegression(
            solver='lbfgs', multi_class='multinomial',
            C=1.0, class_weight='balanced', tol=1e-6, max_iter=1000,
            n_jobs=-1))
predictions = sclf.fit(X_train, y_train).predict(X_test)

这是完整的错误：

Traceback (most recent call last):
  File "C:/Users/inf10926/PycharmProjects/profiling/venv/lab.py", line 52, in <module>
    predictions = sclf.fit(X_train, y_train).predict(X_test)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\mlxtend\classifier\stacking_classification.py", line 161, in fit
    clf.fit(X, y)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 352, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 317, in _fit
    **fit_params_steps[name])
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\joblib\memory.py", line 355, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 716, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1652, in fit_transform
    X = super().fit_transform(raw_documents)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1058, in fit_transform
    self.fixed_vocabulary_)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 970, in _count_vocab
    for feature in analyze(doc):
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 352, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 256, in <lambda>
    return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'

Process finished with exit code 1

如果我在TfidfVectorizer 中更改为lowercase=False，我会得到另一种错误：

Traceback (most recent call last):
  File "C:/Users/inf10926/PycharmProjects/profiling/venv/lab.py", line 52, in <module>
    predictions = sclf.fit(X_train, y_train).predict(X_test)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\mlxtend\classifier\stacking_classification.py", line 161, in fit
    clf.fit(X, y)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 352, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 317, in _fit
    **fit_params_steps[name])
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\joblib\memory.py", line 355, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 716, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1652, in fit_transform
    X = super().fit_transform(raw_documents)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1058, in fit_transform
    self.fixed_vocabulary_)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 970, in _count_vocab
    for feature in analyze(doc):
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 352, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 265, in <lambda>
    return lambda doc: token_pattern.findall(doc)
TypeError: cannot use a string pattern on a bytes-like object

【问题讨论】：

看起来像是在期待一个字符串
@Mad Phycicist：感谢您的光临。是的，我猜通常在一列中它是一个单词列表，它“自动转换成一个巨大的字符串”，或者类似的东西。但是双列问题可能会更改该列表或 numpy.darray 中的任何内容。不知道什么是解决方法

标签： python-3.x scikit-learn pipeline attributeerror tfidfvectorizer

【解决方案1】：

我遇到了同样的问题。我通过将drop_axis = True 添加到ColumnSelector 来解决它。只选择一列时需要添加此参数。

请参考这里的API：http://rasbt.github.io/mlxtend/user_guide/feature_selection/ColumnSelector/#api

【讨论】：