使用 scikit-learn 去除低方差的特征答案

【问题标题】：Removing features with low variance using scikit-learn使用 scikit-learn 去除低方差的特征
【发布时间】：2015-05-31 16:19:30
【问题描述】：

scikit-learn 提供了多种删除描述符的方法，下面给出的教程已经提供了用于此目的的基本方法，

http://scikit-learn.org/stable/modules/feature_selection.html

但是本教程没有提供任何方法或方式来告诉您保留已删除或保留的功能列表的方法。

下面的代码取自教程。

    from sklearn.feature_selection import VarianceThreshold
    X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
    sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
    sel.fit_transform(X)
array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

上面给出的示例代码仅描述了两个描述符“shape(6, 2)”，但在我的例子中，我有一个形状为（第 51 行，第 9000 列）的巨大数据框。找到合适的模型后，我想跟踪有用和无用的特征，因为我可以通过只计算有用的特征来节省计算测试数据集特征的计算时间。

例如，当您使用 WEKA 6.0 执行机器学习建模时，它在特征选择方面提供了极大的灵活性，在删除无用特征后，您可以获得丢弃特征的列表以及有用特征。

谢谢

【问题讨论】：

Sklearn 的工作方式与 WEKA 不同。在这种情况下，sklearn 不会为您提供最佳特征列表，而是直接返回一个具有最佳特征的新数组。你真的需要这份清单吗？我猜他们的列表可以用一种变通方法来计算，但真的需要吗？
@iluengo 根据我的理解（因为我在 ML 方面的经验不是很丰富，但是一个热情的精益者）训练和测试集应该具有相同数量的特征和相同的索引，否则在 weka it rase 错误的情况下.如果测试集是通过数据拆分内部派生的，我将始终具有相同的特征和相同的索引，但如果我们使用外部数据测试集或未知数据集，在不知道特征名称的情况下进行哪些预测，我们如何制作未知数据.
是的，你没看错。我只是在训练中想啊哈

标签： python-2.7 scikit-learn scikits

【解决方案1】：

那么，如果我没记错的话，你可以做的是：

对于VarianceThreshold，您可以调用方法fit 而不是fit_transform。这将适合数据，产生的方差将存储在 vt.variances_ 中（假设 vt 是您的对象）。

有了阈值，您可以像fit_transform 那样提取转换的特征：

X[:, vt.variances_ > threshold]

或者获取索引为：

idx = np.where(vt.variances_ > threshold)[0]

或者作为面具

mask = vt.variances_ > threshold

PS：默认阈值为0

编辑：

更直接的做法是使用VarianceThreshold 类的方法get_support。来自文档：

get_support([indices])  Get a mask, or integer index, of the features selected

您应该在fit 或fit_transform 之后调用此方法。

【讨论】：

拟合后得到过滤后的数据框：df.loc[:, sel.get_support()] 其中df是pandas数据框，sel是VarianceThreshold。
@arun：我认为您的解决方案实际上是最好的。谢谢。

【解决方案2】：

import numpy as np
import pandas as pd
from sklearn.feature_selection import VarianceThreshold

# Just make a convenience function; this one wraps the VarianceThreshold
# transformer but you can pass it a pandas dataframe and get one in return

def get_low_variance_columns(dframe=None, columns=None,
                             skip_columns=None, thresh=0.0,
                             autoremove=False):
    """
    Wrapper for sklearn VarianceThreshold for use on pandas dataframes.
    """
    print("Finding low-variance features.")
    try:
        # get list of all the original df columns
        all_columns = dframe.columns

        # remove `skip_columns`
        remaining_columns = all_columns.drop(skip_columns)

        # get length of new index
        max_index = len(remaining_columns) - 1

        # get indices for `skip_columns`
        skipped_idx = [all_columns.get_loc(column)
                       for column
                       in skip_columns]

        # adjust insert location by the number of columns removed
        # (for non-zero insertion locations) to keep relative
        # locations intact
        for idx, item in enumerate(skipped_idx):
            if item > max_index:
                diff = item - max_index
                skipped_idx[idx] -= diff
            if item == max_index:
                diff = item - len(skip_columns)
                skipped_idx[idx] -= diff
            if idx == 0:
                skipped_idx[idx] = item

        # get values of `skip_columns`
        skipped_values = dframe.iloc[:, skipped_idx].values

        # get dataframe values
        X = dframe.loc[:, remaining_columns].values

        # instantiate VarianceThreshold object
        vt = VarianceThreshold(threshold=thresh)

        # fit vt to data
        vt.fit(X)

        # get the indices of the features that are being kept
        feature_indices = vt.get_support(indices=True)

        # remove low-variance columns from index
        feature_names = [remaining_columns[idx]
                         for idx, _
                         in enumerate(remaining_columns)
                         if idx
                         in feature_indices]

        # get the columns to be removed
        removed_features = list(np.setdiff1d(remaining_columns,
                                             feature_names))
        print("Found {0} low-variance columns."
              .format(len(removed_features)))

        # remove the columns
        if autoremove:
            print("Removing low-variance features.")
            # remove the low-variance columns
            X_removed = vt.transform(X)

            print("Reassembling the dataframe (with low-variance "
                  "features removed).")
            # re-assemble the dataframe
            dframe = pd.DataFrame(data=X_removed,
                                  columns=feature_names)

            # add back the `skip_columns`
            for idx, index in enumerate(skipped_idx):
                dframe.insert(loc=index,
                              column=skip_columns[idx],
                              value=skipped_values[:, idx])
            print("Succesfully removed low-variance columns.")

        # do not remove columns
        else:
            print("No changes have been made to the dataframe.")

    except Exception as e:
        print(e)
        print("Could not remove low-variance features. Something "
              "went wrong.")
        pass

    return dframe, removed_features

【讨论】：

非常有用的方法。我还发现将 skip_columns 的初始值与空列表 [] 而不是 None 放在一起很有用，因为如果我不打算跳过任何列， None 将引发异常
@Sarah 正确，但是您可以只使用标准的sklearn.feature_selection.VarianceThreshold 和底层的numpy 数组而不是pandas.DataFrame。 :)
@JasonWolosonovich 当我尝试上述方法时，我得到“UnboundLocalError: local variable 'removed_features' referenced before assignment”............任何修复？？跨度>

【解决方案3】：

如果您想准确查看阈值后剩余的列，这对我有用，您可以使用此方法：

from sklearn.feature_selection import VarianceThreshold
threshold_n=0.95
sel = VarianceThreshold(threshold=(threshold_n* (1 - threshold_n) ))
sel_var=sel.fit_transform(data)
data[data.columns[sel.get_support(indices=True)]]

【讨论】：

【解决方案4】：

在测试功能时，我编写了这个简单的函数，它告诉我在应用 VarianceThreshold 后哪些变量仍保留在数据框中。

from sklearn.feature_selection import VarianceThreshold
from itertools import compress

def fs_variance(df, threshold:float=0.1):
    """
    Return a list of selected variables based on the threshold.
    """

    # The list of columns in the data frame
    features = list(df.columns)
    
    # Initialize and fit the method
    vt = VarianceThreshold(threshold = threshold)
    _ = vt.fit(df)
    
    # Get which column names which pass the threshold
    feat_select = list(compress(features, vt.get_support()))
    
    return feat_select

它返回一个被选中的列名列表。例如：['col_2','col_14', 'col_17']。

【讨论】：