根据一列从 df 中删除异常值答案

【问题标题】：remove outliers from df based on one column根据一列从 df 中删除异常值
【发布时间】：2023-02-18 04:45:01
【问题描述】：

我的 df 有一个价格栏，看起来像

0         2125.000000
1        14469.483703
2        14101.832820
3        20287.619019
4        14469.483703
             ...     
12561     2490.000000
12562     2931.283333
12563     1779.661017
12566     2200.000000
12567     2966.666667

我想删除 price_m2 列中带有异常值的所有 df 行。我尝试了两种方法：

第一：

df_w_o = df[np.abs(df.price_m2-df.price_m2.mean())<=(1*df.price_m2.std())]

第二：

df['z_score'] = (df['price_m2'] - df['price_m2'].mean()) / df['price_m2'].std()

df_w_o = df[(df['z_score'] < 1) & (df['z_score'] > -1)]

当我检查我的最小最大值后我得到：

print(df_w_o.price_m2.min())
print(df_w_o.price_m2.max())
0.0
25438.022812290565

前我得到的移除：

print(df.price_m2.min())
print(df.price_m2.max())
0.0
589933.4267822268

这感觉不对，我如何才能获得本应与房地产有关的数据的如此大的价格范围。在此示例中，0 是极低值，在移除异常值后仍然存在。

【问题讨论】：

请记住，outilers 在正态分布中位于 > mean+2*std 和 < mean-2*std，两条尾巴。
你是说这个df_w_o = df[(df['z_score'] < 1) & (df['z_score'] > -1)]应该是df_w_o = df[(df['z_score'] < std) & (df['z_score'] > -std)]？我使用 1std 的理由是：因为它是一个狭窄地理区域的数据价格集，我假设 1 倍 std 应该更准确

标签： python pandas

【解决方案1】：

我建议你使用 neulab 库（参见：https://pypi.org/project/neulab）。

它应该适用于您的数据框。例如，您可以使用 Chauvenet 算法：

from neulab.OutlierDetection import Chauvenet

d = {'col1': [8.02, 8.16, 3.97, 8.64, 0.84, 4.46, 0.81, 7.74, 8.78, 9.26, 20.46, 29.87, 10.38, 25.71], 'col2': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
df = pd.DataFrame(data=d)

chvn = Chauvenet(dataframe=df, info=True, autorm=True)

Output: Detected outliers: {'col1': [29.87, 25.71, 20.46, 0.84, 0.81, 3.97, 4.46, 10.38, 7.74, 9.26]}

    col1    col2
0   8.02    1
1   8.16    1
3   8.64    1
8   8.78    1

或使用度量算法查找异常值：

from neulab.OutlierDetection import DistQuant

d = {'col1': [-6, 0, 1, 2, 4, 5, 5, 6, 7, 100], 'col2': [-1, 0, 1, 2, 0, 0, 1, 0, 50, 13]}
df = pd.DataFrame(data=d)

mdist = DistQuant(dataframe=df, metric='manhattan', filter='quantile', info=True, autorm=True)

Output: Distances: {0: 260.0, 1: 204.0, 2: 198.0, 3: 198.0, 4: 190.0, 5: 190.0, 6: 190.0, 7: 194.0, 8: 566.0, 9: 1014.0}

index col1  col2
1      0    0
2      1    1
3      2    2
4      4    0
5      5    0
6      5    1
7      6    0

【讨论】：

很高兴知道这个库在 Mac M1 芯片上运行的 mini-forge 3 中不可用。
TY 你的回复。将在下一个版本中修复它。

【解决方案2】：

假设 OP 拥有的原始数据呈正态分布，并且没有异常值。原始数据集的高值（大约 589933）很可能是数据集的异常值。让我们创建一个随机创建的数据集的分位数-分位数图：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

n = 100

np.random.seed(0)
df = pd.DataFrame({"price": np.random.normal(25000, 3000, n)})
qqplt = sm.qqplot(df["price"], line = 's',fit = True)
plt.show()

但是，我们可以用一个异常值来完全扭曲这一点。

outlier = 600000
df.loc[n] = outlier
qqplt = sm.qqplot(df["price"], line = 's',fit = True)
plt.show()

每当我们谈论异常值移除并且“感觉不对”时，我们真的需要退后一步来查看数据。正如@kndahl 所建议的那样，使用包含启发式方法和数据删除方法的包很好。否则，直觉应该用你自己的统计分析来支持。

最后，至于0为什么还在最终的数据集中，我们再来看一下。我们将向数据集添加 0 并运行异常值删除。首先，我们将查看运行您的默认离群值移除，然后我们将首先移除极高的 600,000 美元，然后再运行您的离群值方法。

## simulated data with 0 also added
df.loc[n+1] = 0
df_w_o = df[np.abs(df.price-df.price.mean())<=(1*df.price.std())] 
print(f"With the high outlier of 600,000 still in the original dataset, the new range is 
Min:{df_w_o.price.min()}
Max:{df_w_o.price.max()}")

## With the high outlier of 600,000 still in the original dataset, the new range is 
## Min:0.0
## Max:31809.263871962823

## now lets remove the high outlier first before doing our outlier removal
df = df.drop(n)

df_w_o = df[np.abs(df.price-df.price.mean())<=(1*df.price.std())] 
print(f"

With the outlier of 600,000 removed prior to analyzing the data, the new range is 
Min:{df_w_o.price.min()}
Max:{df_w_o.price.max()}")

## With the outlier of 600,000 removed prior to analyzing the data, the new range is
## Min:21241.61391985022
## Max:28690.87204218316

在这个模拟案例中，高离群值使统计数据严重偏斜，以至于 0 在一个标准差的范围内。一旦我们在处理之前清理了数据，那个 0 就被删除了。相关的，这在提供更完整的数据集的交叉验证上可能会更好。

【讨论】：

这是有道理的。但是我不能手动删除它，因为我的数据库非常大，这只是一个地理样本（查询中心周围 1Km 的圆）。我需要一个适用于整个国家的解决方案。如果我在 df[np.abs(df.price-df.price.mean())<=(1*df.price.std())] 之前删除样本中前 1-2% 和低 1-2% 的值，从数据分析的角度来看，它是否仍然被认为是可以接受的？或者这只是不好的做法？
更新：在使用 Z 得分离群值清理方法之前，我确实删除了最高百分位数，哦，天哪，结果更像我最初的预期！
我不确定我会那样做；我想看看分布。但是，我会说你可以做一些清理——删除所有 0 的价格是合理的。或许您也可以查看前 10 个值，因为它不需要很多坏值来倾斜。总的来说，您正在尝试清除虚假值。无论如何，擦洗最高百分位数并不是有史以来最糟糕的事情。如果此答案有帮助，请考虑接受。

【解决方案3】：

@SlimPun，这就是我的意思：

import pandas as pd
import numpy as np

df=pd.DataFrame(np.random.normal(loc=10,scale=5,size=1000))  ## 1000 itens in price column
df.columns=["Price"]

用 nan 替换异常值：

df[(df.Price>(np.mean(df.Price)+2*np.std(df.Price))) | (df.Price<(np.mean(df.Price)-2*np.std(df.Price)))]=np.nan

丢弃异常值

df=df.dropna(how='all')
df.shape ## (951,1) - without outliers ** this can change according to your distribution given by numpy

【讨论】：

【解决方案4】：

这将使用对每个数字列的过滤来清除异常值，这需要对位于上限和下限之外的数据点进行异常值处理。

column_list = ['col1', 'col2']

def outlier_clean(df, column_list):
    for i in column_list:
        q1 = np.quantile(df[i], 0.25)
        q3 = np.quantile(df[i], 0.75)
        median = np.median(df[i])
        IQR = q3 - q1
        upper_cap = median + (1.5 * IQR)
        lower_cap = median - (1.5 * IQR)
        mask1 = df[i] < upper_cap  
        mask2 =df[i] > lower_cap
      
        df = df[mask1 | mask2]
    return df

df = outlier_clean(df, column_list)

【讨论】：