【问题标题】:Unable to do Stacking for a Multi-label classifier无法为多标签分类器进行堆叠
【发布时间】:2020-08-02 04:36:36
【问题描述】:

我正在研究一个多标签文本分类问题(目标标签总数 90)。数据分布具有长尾和类别不平衡以及大约 10 万条记录。我正在使用 OAA 策略(一对一)。我正在尝试使用 Stacking 创建一个合奏。

文本特征:HashingVectorizer(特征数 2**20,字符分析器)
TSVD 降低维度(n_components=200)。

text_pipeline = Pipeline([
    ('hashing_vectorizer', HashingVectorizer(n_features=2**20,
                                             analyzer='char')),
    ('svd', TruncatedSVD(algorithm='randomized',
                         n_components=200, random_state=19204))])

feat_pipeline = FeatureUnion([('text', text_pipeline)])

estimators_list = [('ExtraTrees',
                    OneVsRestClassifier(ExtraTreesClassifier(n_estimators=30,
                                                             class_weight="balanced",
                                                             random_state=4621))),
                   ('linearSVC',
                    OneVsRestClassifier(LinearSVC(class_weight='balanced')))]
estimators_ensemble = StackingClassifier(estimators=estimators_list,
                                         final_estimator=OneVsRestClassifier(
                                             LogisticRegression(solver='lbfgs',
                                                                max_iter=300)))

classifier_pipeline = Pipeline([
    ('features', feat_pipeline),
    ('clf', estimators_ensemble)])

错误

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-41-ad4e769a0a78> in <module>()
      1 start = time.time()
----> 2 classifier_pipeline.fit(X_train.values, y_train_encoded)
      3 print(f"Execution time {time.time()-start}")
      4 

3 frames
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
    795         return np.ravel(y)
    796 
--> 797     raise ValueError("bad input shape {0}".format(shape))
    798 
    799 

ValueError: bad input shape (89792, 83)

【问题讨论】:

    标签: machine-learning scikit-learn nlp multilabel-classification ensemble-learning


    【解决方案1】:

    StackingClassifier 目前不支持多标签分类。您可以通过查看fit 参数的形状值(例如here)来了解这些功能。

    解决方案是将 OneVsRestClassifier 包装器放在 StackingClassifier 之上,而不是放在单个模型上。

    例子:

    from sklearn.datasets import make_multilabel_classification
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import ExtraTreesClassifier
    from sklearn.svm import LinearSVC
    from sklearn.ensemble import StackingClassifier
    from sklearn.multiclass import OneVsRestClassifier
    
    X, y = make_multilabel_classification(n_classes=3, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        test_size=0.33,
                                                        random_state=42)
    
    estimators_list = [('ExtraTrees', ExtraTreesClassifier(n_estimators=30, 
                                                           class_weight="balanced", 
                                                           random_state=4621)),
                       ('linearSVC', LinearSVC(class_weight='balanced'))]
    
    estimators_ensemble = StackingClassifier(estimators=estimators_list,
                                             final_estimator = LogisticRegression(solver='lbfgs', max_iter=300))
    
    ovr_model = OneVsRestClassifier(estimators_ensemble)
    
    ovr_model.fit(X_train, y_train)
    ovr_model.score(X_test, y_test)
    
    # 0.45454545454545453
    
    

    【讨论】:

      猜你喜欢
      • 2020-08-28
      • 2021-12-24
      • 2018-12-26
      • 2021-05-19
      • 2017-05-04
      • 1970-01-01
      • 1970-01-01
      • 2019-09-24
      • 2016-05-09
      相关资源
      最近更新 更多