【问题标题】:How do I incorporate SelectKBest in an SKlearn pipeline如何将 SelectKBest 合并到 SKlearn 管道中
【发布时间】:2020-10-02 04:35:15
【问题描述】:

我正在尝试使用 sklearn 构建文本分类器。这个想法是:

  1. 使用向量化训练语料库 TfidfVectorizer
  2. 使用SelectKBest 选择产生的前 20,000 个特征(或者如果结果数量低于 20k,则使用所有特征)
  3. 将这些功能输入Logistic Regression Classifier

我设置成功如下:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression

vectorizer = TfidfVectorizer()
x_train = vectorizer.fit_transform(df_train["input"])
selector = SelectKBest(f_classif, k=min(20000, x_train.shape[1]))
selector.fit(x_train, df_train["label"].values)
x_train = selector.transform(x_train)
classifier = LogisticRegression()
classifier.fit(x_train, df_train["label"])

我现在想将所有这些打包到一个管道中,并共享该管道,以便其他人可以将其用于他们自己的文本数据。然而,我不知道如何让 SelectKBest 实现与上面相同的行为,即接受 min(20000, n_features from vectorizer output) 作为 k。如果我将其简单地保留为 k=20000,如下所示,当拟合具有少于 20k 向量化特征的新语料库时,管道将不起作用(引发错误)。

pipe = Pipeline([
            ("vect",TfidfVectorizer()),
            ("selector",SelectKBest(f_classif, k=20000)),
            ("clf",LogisticRegression())])

【问题讨论】:

  • 错误是由于this function check here。您需要继承 SelectKBest 并实施您自己的检查以查看 k 是否小于 X 的形状。如果没有,则分配新的k

标签: python scikit-learn


【解决方案1】:

正如@vivek kumar 指出的,您需要覆盖SelectKBest_check_params 方法并将您的逻辑添加到其中,如下所示:

class MySelectKBest(SelectKBest):
    def _check_params(self, X, y):
        if (self.k >= X.shape[1]):
            warnings.warn("Less than %d number of features found, so setting k as %d" % (self.k, X.shape[1]),
                      UserWarning)
            self.k = X.shape[1]
        if not (self.k == "all" or 0 <= self.k):
            raise ValueError("k should be >=0, <= n_features = %d; got %r. "
                             "Use k='all' to return all features."
                             % (X.shape[1], self.k)) 

如果发现的特征数量少于设置的阈值,我还设置了警告。现在让我们看一个相同的工作示例:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import warnings

categories = ['alt.atheism', 'comp.graphics',
              'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
              'comp.windows.x', 'misc.forsale', 'rec.autos']
newsgroups = fetch_20newsgroups(categories=categories)
y_true = newsgroups.target

# newsgroups result in 47K odd features after performing TFIDF vectorizer

# Case 1: When K < No. of features - the regular case
pipe = Pipeline([
            ("vect",TfidfVectorizer()),
            ("selector",MySelectKBest(f_classif, k=30000)),
            ("clf",LogisticRegression())])

pipe.fit(newsgroups.data, y_true)
pipe.score(newsgroups.data, y_true)
#0.968

#Case 2: When K > No. of cases - the one with an issue

pipe = Pipeline([
            ("vect",TfidfVectorizer()),
            ("selector",MySelectKBest(f_classif, k=50000)),
            ("clf",LogisticRegression())])

pipe.fit(newsgroups.data, y_true)
UserWarning: Less than 50000 number of features found, so setting k as 47407

pipe.score(newsgroups.data, y_true)
#0.9792

希望这会有所帮助!

【讨论】:

    猜你喜欢
    • 2016-07-06
    • 2016-07-15
    • 2020-12-20
    • 2016-05-11
    • 2019-09-21
    • 2017-01-26
    • 2019-08-15
    • 2018-12-12
    • 2017-03-24
    相关资源
    最近更新 更多