这是在每一列上使用 IQR 的一种方法。
我发现想出一个简单的“动词”函数通常非常有用,该函数可以轻松测试和演示,然后根据需要多次应用。
is_outlier <- function(x, iqrfac = 3) {
quants <- quantile(x, na.rm = TRUE)
iqr <- quants[4] - quants[2]
!is.na(x) & (x < (quants[2] - iqrfac*iqr) | (quants[4] + iqrfac*iqr) < x)
}
查看您上面的示例数据,我更改了 $hippocamp 中的一个值,以获得更多异常值(除了 $amygdala 中的最后一个值)...
dat <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
id cuneus hippocamp icv amygdala putamen
1 5.1 3.5 1.4 0.2 5
2 4.9 3.0 1.4 0.2 4
3 4.7 13.2 1.3 0.2 10
4 4.6 3.1 1.5 0.2 1
5 5.0 3.6 1.4 0.2 4
6 5.4 3.9 1.7 0.4 8")
lapply(dat, is_outlier)
# $id
# [1] FALSE FALSE FALSE FALSE FALSE FALSE
# $cuneus
# [1] FALSE FALSE FALSE FALSE FALSE FALSE
# $hippocamp
# [1] FALSE FALSE TRUE FALSE FALSE FALSE
# $icv
# [1] FALSE FALSE FALSE FALSE FALSE FALSE
# $amygdala
# [1] FALSE FALSE FALSE FALSE FALSE TRUE
# $putamen
# [1] FALSE FALSE FALSE FALSE FALSE FALSE
从那里,我们可以使用lapply 的组合来返回逻辑向量列表(其中TRUE 表示该向量中的值需要替换)、replace(进行替换)和Map(将replace 函数映射到每一列和每个逻辑向量)。这可能看起来很复杂,但是...
dat[] <- Map(replace, dat, lapply(dat, is_outlier), NA)
dat
# id cuneus hippocamp icv amygdala putamen
# 1 1 5.1 3.5 1.4 0.2 5
# 2 2 4.9 3.0 1.4 0.2 4
# 3 3 4.7 NA 1.3 0.2 10
# 4 4 4.6 3.1 1.5 0.2 1
# 5 5 5.0 3.6 1.4 0.2 4
# 6 6 5.4 3.9 1.7 NA 8
Map 实际上类似于lapply。
lapply(mydat, myfunc)
# "unrolls" to
list(
myfunc(mydat[[1]]),
myfunc(mydat[[2]]),
myfunc(mydat[[3]]),
...,
myfunc(mydat[[n]])
)
# equivalently
Map(myfunc, mydat) # reversed arguments
# "unrolls" to
list(
myfunc(mydat[[1]]),
myfunc(mydat[[2]]),
myfunc(mydat[[3]]),
...,
myfunc(mydat[[n]])
)
# extended
Map(otherfunc, datA, datB, datC)
# "unrolls" to
list(
otherfunc(datA[[1]], datB[[1]], datC[[1]]),
otherfunc(datA[[2]], datB[[2]], datC[[2]]),
otherfunc(datA[[3]], datB[[3]], datC[[3]]),
...,
otherfunc(datA[[n]], datB[[n]], datC[[n]])
)
# assuming that datA, datB, and datC are all the same length
Darren 使用 lowerq 和 upperq 代替 data 的评论是正确的。请允许我向您展示为什么会出现这种情况。
set.seed(42)
data <- rnorm(20)
data
# [1] 1.37095845 -0.56469817 0.36312841 0.63286260 0.40426832 -0.10612452 1.51152200
# [8] -0.09465904 2.01842371 -0.06271410 1.30486965 2.28664539 -1.38886070 -0.27878877
# [15] -0.13332134 0.63595040 -0.28425292 -2.65645542 -2.44046693 1.32011335
quant <- quantile(data, na.rm = TRUE)
lowerq <- quant[2]
upperq <- quant[4]
iqr <- upperq - lowerq
cbind(data, lower=data-(iqr * 3), upper=data+(iqr * 3))
# data lower upper
# [1,] 1.37095845 -3.395548 6.137465
# [2,] -0.56469817 -5.331204 4.201808
# [3,] 0.36312841 -4.403378 5.129635
# [4,] 0.63286260 -4.133644 5.399369
# [5,] 0.40426832 -4.362238 5.170774
# [6,] -0.10612452 -4.872631 4.660382
# [7,] 1.51152200 -3.254984 6.278028
# [8,] -0.09465904 -4.861165 4.671847
# [9,] 2.01842371 -2.748082 6.784930
# [10,] -0.06271410 -4.829220 4.703792
# [11,] 1.30486965 -3.461636 6.071376
# [12,] 2.28664539 -2.479861 7.053152
# [13,] -1.38886070 -6.155367 3.377645
# [14,] -0.27878877 -5.045295 4.487717
# [15,] -0.13332134 -4.899827 4.633185
# [16,] 0.63595040 -4.130556 5.402457
# [17,] -0.28425292 -5.050759 4.482253
# [18,] -2.65645542 -7.422962 2.110051
# [19,] -2.44046693 -7.206973 2.326039
# [20,] 1.32011335 -3.446393 6.086619
从data 中减去iqr*3 就是从data 中的每个值中减去它。这意味着所有_threshold 值将始终位于iqr*3 下方和iqr*3 上方每个data。这相当于说:
data > (data - (iqr*3)) | data < (data - (iqr*3))
这是总是正确的。
相反,
cbind(data, lower = lowerq-(iqr * 3), upper = upperq+(iqr * 3))
# data lower upper
# [1,] 1.37095845 -5.046661 6.075187
# [2,] -0.56469817 -5.046661 6.075187
# [3,] 0.36312841 -5.046661 6.075187
# [4,] 0.63286260 -5.046661 6.075187
# [5,] 0.40426832 -5.046661 6.075187
# [6,] -0.10612452 -5.046661 6.075187
# [7,] 1.51152200 -5.046661 6.075187
# [8,] -0.09465904 -5.046661 6.075187
# [9,] 2.01842371 -5.046661 6.075187
# [10,] -0.06271410 -5.046661 6.075187
# [11,] 1.30486965 -5.046661 6.075187
# [12,] 2.28664539 -5.046661 6.075187
# [13,] -1.38886070 -5.046661 6.075187
# [14,] -0.27878877 -5.046661 6.075187
# [15,] -0.13332134 -5.046661 6.075187
# [16,] 0.63595040 -5.046661 6.075187
# [17,] -0.28425292 -5.046661 6.075187
# [18,] -2.65645542 -5.046661 6.075187
# [19,] -2.44046693 -5.046661 6.075187
# [20,] 1.32011335 -5.046661 6.075187
(在此示例中始终如此,但至少您可以看到比较是针对每个较低/较高的单个值进行的。)