【问题标题】:return value to df after several operations多次操作后将值返回给 df
【发布时间】:2023-01-30 23:59:20
【问题描述】:

我为相对较大的数据帧 df 运行 IPR 异常值控制: 我在数据的子集中执行 IPR,因此我使用 for 循环。

如何将值返回到原始 df >1 000 000 行:

        months product  brick  units  is_outlier
0       202104  abc      3   1.00       False
1       202104  abc      6   3.00       False
for product in df['product'].unique():
    for brick in df['brick'].unique():
        try:
                # Extract the units for the current product and brick
                data = df.loc[(df['product'] == product) & (df['brick'] == brick)]['units'].values

                # Scale the data
                scaler = StandardScaler()
                data_scaled = scaler.fit_transform(data.reshape(-1, 1))

                # Fit a linear regression model to the data
                reg = LinearRegression()
                reg.fit(np.arange(len(data_scaled)).reshape(-1, 1), data_scaled)

                # Calculate the residuals of the regression
                residuals = data_scaled - reg.predict(np.arange(len(data_scaled)).reshape(-1, 1))

                # Identify any observations with a residual larger than 2 standard deviations from the mean
                threshold = 2*residuals.std()
                outliers = np.where(np.abs(residuals) > threshold)

                # Set the "is_outlier" column to True for the outliers in the current product
                df.loc[(df['product'] == product ) & (df['brick']== brick) & (df.index.isin(outliers[0])), 'is_outlier'] = True
        except:
            pass

【问题讨论】:

  • for brick in df['brick'].unique(): 听起来像是 groupby 的工作。
  • 我更新了我的问题

标签: python pandas outliers


【解决方案1】:

正如@QuangHoang 建议的那样,使用 groupbyapply 您的自定义函数:

def outlier(data):
    # Scale the data
    scaler = StandardScaler()
    data_scaled = scaler.fit_transform(data)

    # Fit a linear regression model to the data
    reg = LinearRegression()
    reg.fit(np.arange(len(data_scaled)).reshape(-1, 1), data_scaled)

    # Calculate the residuals of the regression
    residuals = data_scaled - reg.predict(np.arange(len(data_scaled)).reshape(-1, 1))

    # Identify any observations with a residual
    # larger than 2 standard deviations from the mean
    threshold = 2*residuals.std()
    outliers = np.where(np.abs(residuals) > threshold)
    return outliers


df['is_outlier'] = df.groupby(['product', 'brick'])['units'].apply(outlier)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2021-03-28
    • 2014-08-21
    • 1970-01-01
    • 2018-10-18
    • 1970-01-01
    • 1970-01-01
    • 2012-05-10
    • 1970-01-01
    相关资源
    最近更新 更多