【问题标题】:sklearn ColumnTransformer with MultilabelBinarizer带有 MultilabelBinarizer 的 sklearn ColumnTransformer
【发布时间】:2020-04-02 21:33:04
【问题描述】:

我想知道是否可以在 ColumnTransformer 中使用 MultilabelBinarizer。

我有一个玩具熊猫数据框,例如:

df = pd.DataFrame({"id":[1,2,3], 
"text": ["some text", "some other text", "yet another text"], 
"label": [["white", "cat"], ["black", "cat"], ["brown", "dog"]]})

preprocess = ColumnTransformer(
    [
     ('vectorizer', CountVectorizer(), 'text'),
    ('binarizer', MultiLabelBinarizer(), ['label']),

    ],
    remainder='drop')

但是,这段代码会引发异常:

~/lib/python3.7/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    714     with _print_elapsed_time(message_clsname, message):
    715         if hasattr(transformer, 'fit_transform'):
--> 716             res = transformer.fit_transform(X, y, **fit_params)
    717         else:
    718             res = transformer.fit(X, y, **fit_params).transform(X)

TypeError: fit_transform() takes 2 positional arguments but 3 were given

使用 OneHotEncoder,ColumnTransformer 确实可以工作。

【问题讨论】:

    标签: python python-3.x scikit-learn pipeline


    【解决方案1】:

    我在测试中并没有特别勤奋地了解确切为什么下面的工作,但我能够构建一个自定义的<Transformer>,它基本上“包装”了MultiLabelBinarizer,但也是兼容<ColumnTransformer>

    class MultiLabelBinarizerFixedTransformer(BaseEstimator, TransformerMixin):
        """       
        Wraps `MultiLabelBinarizer` in a form that can work with `ColumnTransformer`
        """
        def __init__(
                self 
            ):
            self.feature_name = ["mlb"]
            self.mlb = MultiLabelBinarizer(sparse_output=False)
    
        def fit(self, X, y=None):
            self.mlb.fit(X)
            return self
    
        def transform(self, X):
            return self.mlb.transform(X)
    
        def get_feature_names(self, input_features=None):
            cats = self.mlb.classes_
            if input_features is None:
                input_features = ['x%d' % i for i in range(len(cats))]
                print(input_features)
            elif len(input_features) != len(self.categories_):
                raise ValueError(
                    "input_features should have length equal to number of "
                    "features ({}), got {}".format(len(self.categories_),
                                                   len(input_features)))
    
            feature_names = [f"{input_features[i]}_{cats[i]}" for i in range(len(cats))]
            return np.array(feature_names, dtype=object)
    

    我的预感MultiLabelBinarizertransform() 使用的set of inputs<ColumnTransformer> 预期的不同。

    【讨论】:

      【解决方案2】:

      对于输入XMultiLabelBinarizer 适合一次处理一列(因为每一行都应该是一个类别序列),而OneHotEncoder 可以处理多列。要使ColumnTransformerMultiHotEncoder 兼容,您需要遍历X 的所有列,并使用MultiLabelBinarizer 拟合/转换每一列。以下应与pandas.DataFrame 输入一起使用。

      from sklearn.base import BaseEstimator, TransformerMixin
      
      class MultiHotEncoder(BaseEstimator, TransformerMixin):
          """Wraps `MultiLabelBinarizer` in a form that can work with `ColumnTransformer`. Note
          that input X has to be a `pandas.DataFrame`.
          """
          def __init__(self):
              self.mlbs = list()
              self.n_columns = 0
              self.categories_ = self.classes_ = list()
      
          def fit(self, X:pd.DataFrame, y=None):
              for i in range(X.shape[1]): # X can be of multiple columns
                  mlb = MultiLabelBinarizer()
                  mlb.fit(X.iloc[:,i])
                  self.mlbs.append(mlb)
                  self.classes_.append(mlb.classes_)
                  self.n_columns += 1
              return self
      
          def transform(self, X:pd.DataFrame):
              if self.n_columns == 0:
                  raise ValueError('Please fit the transformer first.')
              if self.n_columns != X.shape[1]:
                  raise ValueError(f'The fit transformer deals with {self.n_columns} columns '
                                   f'while the input has {X.shape[1]}.'
                                  )
              result = list()
              for i in range(self.n_columns):
                  result.append(self.mlbs[i].transform(X.iloc[:,i]))
      
              result = np.concatenate(result, axis=1)
              return result
      
      # test
      temp = pd.DataFrame({
          "id":[1,2,3], 
          "text": ["some text", "some other text", "yet another text"], 
          "label": [["white", "cat"], ["black", "cat"], ["brown", "dog"]],
          "label2": [["w", "c"], ["b", "c"], ["b", "d"]]
      })
      
      col_transformer = ColumnTransformer([
          ('one-hot', OneHotEncoder(), ['id','text']),
          ('multi-hot', MultiHotEncoder(), ['label', 'label2'])
      ])
      col_transformer.fit_transform(temp)
      

      你应该得到:

      array([[1., 0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 1., 0., 1.],
             [0., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0., 1., 1., 0., 0.],
             [0., 0., 1., 0., 0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0.]])
      

      注意前 3 列和后 3 列是如何单热编码的,而后 5 列和后 4 列是多热编码的。并且可以像往常一样找到类别信息:

      col_transformer.named_transformers_['one-hot'].categories_
      
      >>> [array([1, 2, 3], dtype=object),
           array(['some other text', 'some text', 'yet another text'], dtype=object)]
      
      col_transformer.named_transformers_['multi-hot'].categories_
      
      >>> [array(['black', 'brown', 'cat', 'dog', 'white'], dtype=object),
           array(['b', 'c', 'd', 'w'], dtype=object)]
      

      【讨论】:

        猜你喜欢
        • 2021-06-17
        • 2020-06-30
        • 2019-06-29
        • 2023-03-19
        • 2023-03-18
        • 2021-09-26
        • 2022-10-04
        • 2019-06-29
        • 2020-09-16
        相关资源
        最近更新 更多