是否有一个 numpy 内置函数来拒绝列表中的异常值答案

【问题标题】：Is there a numpy builtin to reject outliers from a list是否有一个 numpy 内置函数来拒绝列表中的异常值
【发布时间】：2012-07-26 01:38:00
【问题描述】：

是否有一个 numpy 内置函数可以执行以下操作？也就是说，获取一个列表d 并返回一个列表filtered_d，根据d 中点的一些假设分布，删除了所有外围元素。

import numpy as np

def reject_outliers(data):
    m = 2
    u = np.mean(data)
    s = np.std(data)
    filtered = [e for e in data if (u - 2 * s < e < u + 2 * s)]
    return filtered

>>> d = [2,4,5,1,6,5,40]
>>> filtered_d = reject_outliers(d)
>>> print filtered_d
[2,4,5,1,6,5]

我说“类似”是因为该函数可能允许不同的分布（泊松、高斯等）和这些分布中的不同异常值阈值（例如我在这里使用的 m）。

【问题讨论】：

相关：Can scipy.stats identify and mask obvious outliers?，尽管这个问题似乎处理更复杂的情况。对于您描述的简单任务，外部包似乎是矫枉过正。
我在想，鉴于主 numpy 库中的内置函数数量，奇怪的是没有什么可做的。处理原始、嘈杂的数据似乎很常见。
线性异常值可以通过numpy std函数找到，但是，如果数据是非线性的，例如抛物线或三次函数，standard deviation将无法很好地处理任务，因为它需要回归来帮助计算异常值。
这就是我编写这个 repo 的原因：outliers.py

标签： python numpy

【解决方案1】：

我的解决方案降低了顶部和底部的百分位数，保持等于边界的值：

def remove_percentile_outliers(data, percent_to_drop=0.001):
    low, high = data.quantile([percent_to_drop / 2, 1-percent_to_drop / 2])
    return data[(data >= low )&(data <= high)]

【讨论】：

【解决方案2】：

答案太多了，但我要添加一个对作者甚至其他用户有用的新答案。

您可以使用 Hampel 过滤器。但是你需要使用Series。

Hampel 过滤器返回异常值索引，然后您可以将它们从Series 中删除，然后将其转换回List。

要使用Hampel过滤器，您可以使用pip轻松安装包：

pip install hampel

用法：

# Imports
from hampel import hampel
import pandas as pd

list_d = [2, 4, 5, 1, 6, 5, 40]

# List to Series
time_series = pd.Series(list_d)

# Outlier detection with Hampel filter
# Returns the Outlier indices
outlier_indices = hampel(ts = time_series, window_size = 3)

# Drop Outliers indices from Series
filtered_d = time_series.drop(outlier_indices)

filtered_d.values.tolist()

print(f'filtered_d: {filtered_d.values.tolist()}')

输出将是：

filtered_d: [2, 4, 5, 1, 6, 5]

其中，ts 是 pandas Series 对象，window_size 是总窗口大小，将计算为 2 * window_size + 1。

对于这个系列，我将window_size 设置为3。

使用 Series 最酷的地方在于能够生成图形：

# Imports
import matplotlib.pyplot as plt

plt.style.use('seaborn-darkgrid')

# Plot Original Series
time_series.plot(style = 'k-')
plt.title('Original Series')
plt.show()
    
# Plot Cleaned Series
filtered_d.plot(style = 'k-')
plt.title('Cleaned Series (Without detected Outliers)')
plt.show()

输出将是：

要了解有关 Hampel 过滤器的更多信息，我推荐以下阅读材料：

【讨论】：

【解决方案3】：

沿轴修剪 numpy 数组中的异常值，并将其替换为沿该轴的最小值或最大值，以更接近者为准。阈值是 z-score：

def np_z_trim(x, threshold=10, axis=0):
    """ Replace outliers in numpy ndarray along axis with min or max values
    within the threshold along this axis, whichever is closer."""
    mean = np.mean(x, axis=axis, keepdims=True)
    std = np.std(x, axis=axis, keepdims=True)
    masked = np.where(np.abs(x - mean) < threshold * std, x, np.nan)
    min = np.nanmin(masked, axis=axis, keepdims=True)
    max = np.nanmax(masked, axis=axis, keepdims=True)
    repl = np.where(np.abs(x - max) < np.abs(x - min), max, min)
    return np.where(np.isnan(masked), repl, masked)

【讨论】：

【解决方案4】：

在这里，我在 x 中找到异常值，并将它们替换为它们周围的点窗口的中值 (win)（取自 Benjamin Bannier 答案中值偏差）

def outlier_smoother(x, m=3, win=3, plots=False):
    ''' finds outliers in x, points > m*mdev(x) [mdev:median deviation] 
    and replaces them with the median of win points around them '''
    x_corr = np.copy(x)
    d = np.abs(x - np.median(x))
    mdev = np.median(d)
    idxs_outliers = np.nonzero(d > m*mdev)[0]
    for i in idxs_outliers:
        if i-win < 0:
            x_corr[i] = np.median(np.append(x[0:i], x[i+1:i+win+1]))
        elif i+win+1 > len(x):
            x_corr[i] = np.median(np.append(x[i-win:i], x[i+1:len(x)]))
        else:
            x_corr[i] = np.median(np.append(x[i-win:i], x[i+1:i+win+1]))
    if plots:
        plt.figure('outlier_smoother', clear=True)
        plt.plot(x, label='orig.', lw=5)
        plt.plot(idxs_outliers, x[idxs_outliers], 'ro', label='outliers')                                                                                                                    
        plt.plot(x_corr, '-o', label='corrected')
        plt.legend()
    
    return x_corr

【讨论】：

【解决方案5】：

在处理异常值时有一点很重要，那就是应该尽量使用稳健的估算器。分布的平均值将受到异常值的影响，但例如中位数会少很多。

基于 eumiro 的回答：

def reject_outliers(data, m = 2.):
    d = np.abs(data - np.median(data))
    mdev = np.median(d)
    s = d/mdev if mdev else 0.
    return data[s<m]

在这里，我将平均值替换为更稳健的中位数，并将标准差替换为中位数到中位数的绝对距离。然后，我按（再次）中间值缩放距离，使 m 处于合理的相对比例。

请注意，要使 data[s<m] 语法起作用，data 必须是一个 numpy 数组。

【讨论】：

itl.nist.gov/div898/handbook/eda/section3/eda35h.htm 这基本上是此处引用的修改后的 Z 分数，但阈值不同。如果我的数学是正确的，他们推荐 3.5 / .6745 ~= 5.189 的 m（他们将 s 乘以 0.6745 并指定 m 的 3.5 ......也取 abs(s)）。谁能解释一下m的选择？还是您会从您的特定数据集中识别出来？
@BenjaminBannier：您能否提供一些具体的解释来为m 选择一个值，而不是像“纯度和效率的相互作用”这样的模糊陈述？
@stackoverflowuser2010：就像我说的，这取决于您的具体要求，即我们需要信号样本的清洁程度（误报），或者我们可以承受多少信号测量保持信号干净（假阴性）。至于特定用例的具体示例评估，请参见例如desy.de/~blist/notes/whyeffpur.ps.gz。
当我使用浮点列表调用函数时出现以下错误：TypeError: only integer scalar arrays can be converted to a scalar index
@Charlie，如果你看图itl.nist.gov/div898/handbook/eda/section3/eda356.htm#MAD，你会发现在处理正态分布时（实际上不是这样，你需要修改后的 z 分数），SD = 1，您有 MAD ~ 0.68，这解释了比例因子。因此，选择 m = 3.5 意味着您要丢弃 0.05 % 的数据。

【解决方案6】：

对于一组图像（每张图像有 3 个维度），我想拒绝我使用的每个像素的异常值：

mean = np.mean(imgs, axis=0)
std = np.std(imgs, axis=0)
mask = np.greater(0.5 * std + 1, np.abs(imgs - mean))
masked = np.multiply(imgs, mask)

那么就可以计算均值了：

masked_mean = np.divide(np.sum(masked, axis=0), np.sum(mask, axis=0))

（我用它来做背景减法）

【讨论】：

【解决方案7】：

如果要获取异常值的索引位置idx_list会返回。

def reject_outliers(data, m = 2.):
        d = np.abs(data - np.median(data))
        mdev = np.median(d)
        s = d/mdev if mdev else 0.
        data_range = np.arange(len(data))
        idx_list = data_range[s>=m]
        return data[s<m], idx_list

data_points = np.array([8, 10, 35, 17, 73, 77])  
print(reject_outliers(data_points))

after rejection: [ 8 10 35 17], index positions of outliers: [4 5]

【讨论】：

【解决方案8】：

考虑到当您的标准差因巨大的异常值而变得非常大时，上述所有方法都会失败。

（与平均计算失败类似，应该计算中位数。虽然，平均值“更容易出现像 stdDv 这样的错误”。）

您可以尝试迭代地应用您的算法，或者使用四分位数范围进行过滤：（这里的“因子”与 n*sigma 范围有关，但仅当您的数据遵循高斯分布时）

import numpy as np

def sortoutOutliers(dataIn,factor):
    quant3, quant1 = np.percentile(dataIn, [75 ,25])
    iqr = quant3 - quant1
    iqrSigma = iqr/1.34896
    medData = np.median(dataIn)
    dataOut = [ x for x in dataIn if ( (x > medData - factor* iqrSigma) and (x < medData + factor* iqrSigma) ) ] 
    return(dataOut)

【讨论】：

对不起，我忽略了上面已经有一个 IQR 建议。由于代码较短，我应该留下这个答案还是删除它？

【解决方案9】：

Benjamin Bannier 的答案在与中位数距离的中位数为 0 时产生了直通，因此我发现此修改后的版本对以下示例中给出的情况更有帮助。

def reject_outliers_2(data, m=2.):
    d = np.abs(data - np.median(data))
    mdev = np.median(d)
    s = d / (mdev if mdev else 1.)
    return data[s < m]

例子：

data_points = np.array([10, 10, 10, 17, 10, 10])
print(reject_outliers(data_points))
print(reject_outliers_2(data_points))

给予：

[[10, 10, 10, 17, 10, 10]]  # 17 is not filtered
[10, 10, 10, 10, 10]  # 17 is filtered (it's distance, 7, is greater than m)

【讨论】：

【解决方案10】：

我想在这个答案中提供两种方法，基于“z score”的解决方案和基于“IQR”的解决方案。

此答案中提供的代码适用于单个暗淡 numpy 数组和多个 numpy 数组。

我们先导入一些模块。

import collections
import numpy as np
import scipy.stats as stat
from scipy.stats import iqr

基于z分数的方法

此方法将测试数字是否超出三个标准差。根据这个规则，如果值是异常值，该方法将返回true，否则返回false。

def sd_outlier(x, axis = None, bar = 3, side = 'both'):
    assert side in ['gt', 'lt', 'both'], 'Side should be `gt`, `lt` or `both`.'

    d_z = stat.zscore(x, axis = axis)

    if side == 'gt':
        return d_z > bar
    elif side == 'lt':
        return d_z < -bar
    elif side == 'both':
        return np.abs(d_z) > bar

基于IQR的方法

该方法会测试该值是否小于q1 - 1.5 * iqr或大于q3 + 1.5 * iqr，类似于SPSS的绘图方法。

def q1(x, axis = None):
    return np.percentile(x, 25, axis = axis)

def q3(x, axis = None):
    return np.percentile(x, 75, axis = axis)

def iqr_outlier(x, axis = None, bar = 1.5, side = 'both'):
    assert side in ['gt', 'lt', 'both'], 'Side should be `gt`, `lt` or `both`.'

    d_iqr = iqr(x, axis = axis)
    d_q1 = q1(x, axis = axis)
    d_q3 = q3(x, axis = axis)
    iqr_distance = np.multiply(d_iqr, bar)

    stat_shape = list(x.shape)

    if isinstance(axis, collections.Iterable):
        for single_axis in axis:
            stat_shape[single_axis] = 1
    else:
        stat_shape[axis] = 1

    if side in ['gt', 'both']:
        upper_range = d_q3 + iqr_distance
        upper_outlier = np.greater(x - upper_range.reshape(stat_shape), 0)
    if side in ['lt', 'both']:
        lower_range = d_q1 - iqr_distance
        lower_outlier = np.less(x - lower_range.reshape(stat_shape), 0)

    if side == 'gt':
        return upper_outlier
    if side == 'lt':
        return lower_outlier
    if side == 'both':
        return np.logical_or(upper_outlier, lower_outlier)

最后，如果要过滤掉异常值，请使用numpy 选择器。

祝你有美好的一天。

【讨论】：

【解决方案11】：

我想做类似的事情，除了将数字设置为 NaN 而不是从数据中删除它，因为如果你删除它，你会改变长度，这可能会弄乱绘图（即，如果你只是从一列中删除异常值在表格中，但您需要它与其他列保持相同，以便您可以将它们相互绘制）。

为此，我使用了numpy's masking functions：

def reject_outliers(data, m=2):
    stdev = np.std(data)
    mean = np.mean(data)
    maskMin = mean - stdev * m
    maskMax = mean + stdev * m
    mask = np.ma.masked_outside(data, maskMin, maskMax)
    print('Masking values outside of {} and {}'.format(maskMin, maskMax))
    return mask

【讨论】：

您也可以将它们 np.clip 到最小和最大允许值以保持尺寸。

【解决方案12】：

另一种方法是对标准差进行稳健估计（假设高斯统计）。查了一下网上的计算器，我看到 90% 的百分位数对应 1.2815σ，95% 对应的是 1.645σ (http://vassarstats.net/tabs.html?#z)

举个简单的例子：

import numpy as np

# Create some random numbers
x = np.random.normal(5, 2, 1000)

# Calculate the statistics
print("Mean= ", np.mean(x))
print("Median= ", np.median(x))
print("Max/Min=", x.max(), " ", x.min())
print("StdDev=", np.std(x))
print("90th Percentile", np.percentile(x, 90))

# Add a few large points
x[10] += 1000
x[20] += 2000
x[30] += 1500

# Recalculate the statistics
print()
print("Mean= ", np.mean(x))
print("Median= ", np.median(x))
print("Max/Min=", x.max(), " ", x.min())
print("StdDev=", np.std(x))
print("90th Percentile", np.percentile(x, 90))

# Measure the percentile intervals and then estimate Standard Deviation of the distribution, both from median to the 90th percentile and from the 10th to 90th percentile
p90 = np.percentile(x, 90)
p10 = np.percentile(x, 10)
p50 = np.median(x)
# p50 to p90 is 1.2815 sigma
rSig = (p90-p50)/1.2815
print("Robust Sigma=", rSig)

rSig = (p90-p10)/(2*1.2815)
print("Robust Sigma=", rSig)

我得到的输出是：

Mean=  4.99760520022
Median=  4.95395274981
Max/Min= 11.1226494654   -2.15388472011
Sigma= 1.976629928
90th Percentile 7.52065379649

Mean=  9.64760520022
Median=  4.95667658782
Max/Min= 2205.43861943   -2.15388472011
Sigma= 88.6263902244
90th Percentile 7.60646688694

Robust Sigma= 2.06772555531
Robust Sigma= 1.99878292462

接近预期值 2。

如果我们要删除高于/低于 5 个标准差的点（1000 个点，我们期望 1 个值 > 3 个标准差）：

y = x[abs(x - p50) < rSig*5]

# Print the statistics again
print("Mean= ", np.mean(y))
print("Median= ", np.median(y))
print("Max/Min=", y.max(), " ", y.min())
print("StdDev=", np.std(y))

这给出了：

Mean=  4.99755359935
Median=  4.95213030447
Max/Min= 11.1226494654   -2.15388472011
StdDev= 1.97692712883

我不知道哪种方法更有效/更强大

【讨论】：

【解决方案13】：

在 Benjamin 的基础上，使用 pandas.Series，并替换 MAD with IQR：

def reject_outliers(sr, iq_range=0.5):
    pcnt = (1 - iq_range) / 2
    qlow, median, qhigh = sr.dropna().quantile([pcnt, 0.50, 1-pcnt])
    iqr = qhigh - qlow
    return sr[ (sr - median).abs() <= iqr]

例如，如果您设置iq_range=0.6，则四分位数范围的百分位数将变为：0.20 <--> 0.80，因此将包含更多个异常值。

【讨论】：

【解决方案14】：

此方法与您的方法几乎相同，只是更多 numpyst（也仅适用于 numpy 数组）：

def reject_outliers(data, m=2):
    return data[abs(data - np.mean(data)) < m * np.std(data)]

【讨论】：

如果m 足够大（例如m=6），该方法就足够了，但对于m 的小值，这会受到均值方差不是稳健估计的影响。
这并不是对这种方法的抱怨，而是对“异常值”这个模糊概念的抱怨
如何选择m？
我还没有让它工作。我不断收到错误 return data[abs(data - np.mean(data))
@johnktejik data arg 需要是一个 numpy 数组。