【问题标题】:Filtering out outliers in Pandas dataframe with rolling median使用滚动中值过滤 Pandas 数据框中的异常值
【发布时间】:2018-04-08 10:17:39
【问题描述】:

我正在尝试从带有日期的 GPS 高程位移散点图中过滤掉一些异常值

我正在尝试使用 df.rolling 计算每个窗口的中值和标准偏差,然后如果它大于 3 个标准偏差,则删除该点。

但是,我想不出一种方法来遍历列并比较滚动计算的中值。

这是我目前的代码

import pandas as pd
import numpy as np

def median_filter(df, window):
    cnt = 0
    median = df['b'].rolling(window).median()
    std = df['b'].rolling(window).std()
    for row in df.b:
      #compare each value to its median




df = pd.DataFrame(np.random.randint(0,100,size=(100,2)), columns = ['a', 'b'])

median_filter(df, 10)

我如何循环遍历并比较每个点并将其删除?

【问题讨论】:

    标签: pandas median outliers rolling-computation


    【解决方案1】:

    很可能有一种更熊猫的方式来做到这一点 - 这有点像 hack,依赖于一种将原始 df 的索引映射到每个滚动窗口的手动方式。 (我选择了 6 号)。直到第 6 行的记录与 first 窗口相关联;第 7 行是第二个窗口,依此类推。

    n = 100
    df = pd.DataFrame(np.random.randint(0,n,size=(n,2)), columns = ['a','b'])
    
    ## set window size
    window=6
    std = 1  # I set it at just 1; with real data and larger windows, can be larger
    
    ## create df with rolling stats, upper and lower bounds
    bounds = pd.DataFrame({'median':df['b'].rolling(window).median(),
    'std':df['b'].rolling(window).std()})
    
    bounds['upper']=bounds['median']+bounds['std']*std
    bounds['lower']=bounds['median']-bounds['std']*std
    
    ## here, we set an identifier for each window which maps to the original df
    ## the first six rows are the first window; then each additional row is a new window
    bounds['window_id']=np.append(np.zeros(window),np.arange(1,n-window+1))
    
    ## then we can assign the original 'b' value back to the bounds df
    bounds['b']=df['b']
    
    ## and finally, keep only rows where b falls within the desired bounds
    bounds.loc[bounds.eval("lower<b<upper")]
    

    【讨论】:

      【解决方案2】:

      只过滤数据框

      df['median']= df['b'].rolling(window).median()
      df['std'] = df['b'].rolling(window).std()
      
      #filter setup
      df = df[(df.b <= df['median']+3*df['std']) & (df.b >= df['median']-3*df['std'])]
      

      【讨论】:

        【解决方案3】:

        这是我对创建中值滤波器的看法:

        def median_filter(num_std=3):
            def _median_filter(x):
                _median = np.median(x)
                _std = np.std(x)
                s = x[-1]
                return s if s >= _median - num_std * _std and s <= _median + num_std * _std else np.nan
            return _median_filter
        
        df.y.rolling(window).apply(median_filter(num_std=3), raw=True)
        

        【讨论】:

          猜你喜欢
          • 2018-03-29
          • 2022-10-05
          • 1970-01-01
          • 2019-08-11
          • 2018-03-08
          • 2017-04-14
          • 1970-01-01
          • 2020-09-05
          • 2022-12-17
          相关资源
          最近更新 更多