Scikit-Learn 自定义 Imputer，随机值在平均值附近答案

【问题标题】：Scikit-Learn Custom Imputer with random value around mean valueScikit-Learn 自定义 Imputer，随机值在平均值附近
【发布时间】：2021-02-02 04:31:12
【问题描述】：

我想为NaN 值所在的列创建一个自定义估算器，用mean - std 和mean + std 范围内的随机值替换我数据中的NaN 值。

这是我目前拥有的 Imputer 的代码：

class GroupImputer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        X = check_array(X, force_all_finite=False)
        self.means = np.nanmean(X, axis=0)
        self.stds = np.nanstd(X, axis=0)
        return self

    def transform(self, X, y=None):
        check_is_fitted(self, 'means')
        check_is_fitted(self, 'stds')
        X = check_array(X, force_all_finite=False)
        # how do i apply to each row of the data?
        return 0

self.means 包含每列的means 列表。

self.stds 包含每列所有stds 的列表。

如何为一行数据中的每个NaN 应用mean - std 和mean + std 之间的随机值？

我必须遍历数据吗？ (for row in X:) 并根据列索引选择正确的均值和标准差？或者有什么方法可以做到这一点？

【问题讨论】：

标签： python scikit-learn

【解决方案1】：

不，您不必遍历数据，假设数据的行数和列数分别为 5 和 4

num_rows,num_cols = 5,4

# just fake two arrays of column means and stds
column_means = np.random.uniform(1,8,num_cols)
column_stds = np.random.rand(num_cols)

disp = np.random.uniform(column_means-column_stds,column_means+column_stds, size=(num_rows,num_cols))

数组disp 类似于

array([[6.29377845, 6.56185572, 5.32590954, 2.14719305],
       [6.36648777, 6.97781432, 4.89773801, 2.21909144],
       [5.38109603, 6.70649396, 5.50100582, 2.26518757],
       [5.59764259, 6.90297057, 5.65199988, 2.25340505],
       [5.80928963, 6.4976407 , 5.23792109, 1.99580784]])

其中该数组的每一列都是从(the column mean - the column std, the column mean + the column std) 范围内均匀采样的。因此，可以将原数组的NaN条目替换为disp的条目。

【讨论】：

【解决方案2】：

不，有更好的选择然后遍历数据。您可以创建一个具有相同形状的均匀随机数组（在所需边界之间），并将索引 i 处的每个 NaN 值替换为相同索引处的随机值。

higher_bound = self.means + self.stds
lower_bound = self.means - self.stds
random_values = numpy.random.uniform(low=lower_bound, high=higher_bound , size=X.shape) #uniformly random array with the same shape
nan_mask = np.isnan(X) #indicates where is nan
X = np.where(nan_mask, random_values, X) #takes from random_values where nan_mask is true, else takes  from original array

【讨论】：

请考虑编辑此答案，添加解释以理解此代码 sn-p。这样做对 OP 的帮助更大！
请不要只发布代码作为答案，还要解释您的代码的作用以及它如何解决问题的问题。带有解释的答案通常更有帮助、质量更好，并且更有可能吸引投票。
谢谢你们俩