【发布时间】:2021-07-03 06:51:34
【问题描述】:
在我下面的df,我想:
- 使用 z 分数识别和标记
col_E中的异常值 - 分别说明如何在两列或多列中使用 z 分数来识别和标记异常值,例如
col_D和col_E
数据集见下文
import pandas as pd
from scipy import stats
# intialise data of lists
df = {
'col_A':['P0', 'P1', 'P2', 'P4', 'P5'],
'col_B':[1,1,1,1,1],
'col_C':[1,2,3,5,9],
'col_D':[120.05, 181.90, 10.34, 153.10, 311.17],
'col_E':[110.21, 191.12, 190.21, 12.00, 245.09 ],
'col_F':[100.22,199.10, 191.13,199.99, 255.19],
'col_G':[140.29, 291.07, 390.22, 245.09, 4122.62],
}
# Create DataFrame
df = pd.DataFrame(df)
# Print the output.
df
期望:首先标记col_D 中的所有异常值,然后标记col_D 和col_E(注意:在我下面的图片中,10.34 和12.00 被随机突出显示)
第一季度
尝试:
#Q1
exclude_cols = ['col_A','col_B','col_C','col_D','col_F','col_G']
include_cols = ['col_E'] # desired column
def flag_outliers(s, exclude_cols):
if s.name in exclude_cols:
print(s.name)
return ''
else:
s=df[(np.abs(stats.zscore(df['col_E'])) > 3)] # not sure of this part of the code
return ['background-color: yellow' if v else '' for v in indexes]
df.style.apply(lambda s: flag_outliers(s, exclude_cols), axis=1, subset=include_cols)
#Q2
exclude_cols = ['col_A','col_B','col_C','col_F','col_G']
include_cols = ['col_D','col_E'] # desired columns
def flag_outliers(s, exclude_cols):
if s.name in exclude_cols:
print(s.name)
return ''
else:
s=df[(np.abs(stats.zscore(df['col_E'])) > 3)] # not sure of this part of the code
return ['background-color: yellow' if v else '' for v in indexes]
df.style.apply(lambda s: flag_outliers(s, exclude_cols), axis=1, subset=include_cols)
谢谢!
【问题讨论】:
-
stats.zscore(df['col_D'])和stats.zscore(df['col_E'])表示没有|z| > 3存在的异常值! -
还有,“两列或更多列中的 z 分数”是什么意思?您是否将这些值合并为一个样本?
标签: python pandas dataframe scipy statsmodels