【问题标题】:Cleaning spikes in time series data using neighbouring data points使用相邻数据点清除时间序列数据中的峰值
【发布时间】:2021-12-06 02:05:25
【问题描述】:

我正在尝试清除 Pandas 数据框中时间序列数据中的数据峰值。

value = 5000
for index, row in gauteng_df.iterrows():
    if index == gauteng_df.shape[0]-1:
        break
    upper, lower = row['Admissions to Date'] + value, row['Admissions to Date'] - value
    a = gauteng_df.iloc[index+1]['Admissions to Date']
    if a > upper or a < lower:
        a = (gauteng_df.iloc[index-1]['Admissions to Date'] + gauteng_df.iloc[index+1]['Admissions to Date'])/2
        gauteng_df.iloc[index]['Admissions to Date'] = a

我尝试引用后续数据点。如果当前数据点超出后续数据点的区间(即点+-值),则当前数据点将替换为前一个数据点和下一个数据点的平均值。不幸的是,当我尝试绘制新图表时,没有反映任何变化,并且尖峰仍然存在。

我将不胜感激任何帮助!此外,df.iterrows() 可能不是最有效的方法,所以如果能提供更好的方法来替换尖峰值,我将不胜感激。

【问题讨论】:

    标签: python pandas dataframe


    【解决方案1】:

    这是一种替代方法,可以省去迭代 DataFrame 值的麻烦:scipy.signal.find_peaks

    import pandas as pd
    import numpy as np
    from scipy.signal import find_peaks
    
    # Example data with a peak and a valley
    gauteng_df = pd.DataFrame({'Admissions to Date':
                               [8000, 4500, 12000, 5500, 
                                3000, 7500,  1000, 8500]
    })
    
    # Peak detection threshold
    value = 5000
    
    # `prominence` sets minimum height above surrounding 
    # signal at which a given value is considered a peak
    peak_idx = find_peaks(gauteng_df['Admissions to Date'], prominence=value)[0]
    
    # To detect valleys deeper than `value`, 
    # run find_peaks on negative of data
    valley_idx = find_peaks(-gauteng_df['Admissions to Date'], prominence=value)[0]
    
    # Combine indexes of peaks and valleys into a single array
    idx = np.concatenate((peak_idx, valley_idx))
    
    # Build an indicator column of peaks and valleys, or outliers
    gauteng_df['outlier'] = False
    gauteng_df.loc[idx, 'outlier'] = True
    
    # Replace each outlier value with NaN
    gauteng_df.loc[gauteng_df['outlier'], 'Admissions to Date'] = np.nan
    
    # Interpolate over NaNs just created with default linear method
    gauteng_df['Interpolated'] = (gauteng_df['Admissions to Date']
                                 .interpolate()
                                 .astype(int))
    
    # Result
    print(gauteng_df)
    
       Admissions to Date  outlier  Interpolated
    0              8000.0    False          8000
    1              4500.0    False          4500
    2                 NaN     True          5000
    3              5500.0    False          5500
    4              3000.0    False          3000
    5              7500.0    False          7500
    6                 NaN     True          8000
    7              8500.0    False          8500
    

    【讨论】:

      【解决方案2】:

      假设您的数据框按时间排序,请使用前一行值创建一个新列...

      df['Previous_admissions_value'] = df[['Admissions to Date']].shift(1, fill_value=0)
      

      ...以及具有下一行值的另一个新列:

      df['Next_admissions_value'] = df[['Admissions to Date']].shift(-1, fill_value=0)
      

      由于第一行和最后一行分别没有上一行和下一行的值,如果使用上面的代码,它们将被填充为 0。如果需要,您可以将它们更改为其他值,手动更新为所需的值。

      然后检查条件并进行更新:

      import numpy 
      
      df['update_condition'] = np.where(abs(df['Admissions to Date'] - df['Next_admissions_value']) > value, 1, 0)
      
      df['Admissions to Date'] = np.where(df['update_condition'] > 0,
                                          (df['Next_admissions_value'] + df['Previous_admissions_value']) / 2.0,
                                           df['Admissions to Date'])
      

      【讨论】:

        猜你喜欢
        • 2017-08-26
        • 1970-01-01
        • 1970-01-01
        • 2021-06-30
        • 2015-08-09
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2022-11-04
        相关资源
        最近更新 更多