【发布时间】:2023-01-30 23:59:20
【问题描述】:
我为相对较大的数据帧 df 运行 IPR 异常值控制: 我在数据的子集中执行 IPR,因此我使用 for 循环。
如何将值返回到原始 df >1 000 000 行:
months product brick units is_outlier
0 202104 abc 3 1.00 False
1 202104 abc 6 3.00 False
for product in df['product'].unique():
for brick in df['brick'].unique():
try:
# Extract the units for the current product and brick
data = df.loc[(df['product'] == product) & (df['brick'] == brick)]['units'].values
# Scale the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data.reshape(-1, 1))
# Fit a linear regression model to the data
reg = LinearRegression()
reg.fit(np.arange(len(data_scaled)).reshape(-1, 1), data_scaled)
# Calculate the residuals of the regression
residuals = data_scaled - reg.predict(np.arange(len(data_scaled)).reshape(-1, 1))
# Identify any observations with a residual larger than 2 standard deviations from the mean
threshold = 2*residuals.std()
outliers = np.where(np.abs(residuals) > threshold)
# Set the "is_outlier" column to True for the outliers in the current product
df.loc[(df['product'] == product ) & (df['brick']== brick) & (df.index.isin(outliers[0])), 'is_outlier'] = True
except:
pass
【问题讨论】:
-
for brick in df['brick'].unique():听起来像是groupby的工作。 -
我更新了我的问题