我们通过仅选择 numeric 列 (select_if) 创建一个函数,遍历这些列 (map) 并对不是异常值的元素进行子集化。这将作为list 的vectors 输出。
library(dplyr)
library(tidyr)
library(purrr)
outlierremoval <- function(dataframe){
dataframe %>%
select_if(is.numeric) %>% #selects on the numeric columns
map(~ .x[!.x %in% boxplot.stats(.)$out]) #%>%
# not clear whether we need to output as a list or data.frame
# if it is the latter, the columns could be of different length
# so we may use cbind.fill
# { do.call(rowr::cbind.fill, c(., list(fill = NA)))}
}
outlierremoval(Clean_Data)
如果我们想保留所有其他列,则使用 map_if 并在末尾附加 NA 使用 cbind.fill 创建 data.frame 输出。但是,这也会导致每列中的行位置根据异常值的数量发生变化
outlierremoval <- function(dataframe){
dataframe %>%
map_if(is.numeric, ~ .x[!.x %in% boxplot.stats(.)$out]) %>%
{ do.call(rowr::cbind.fill, c(., list(fill = NA)))} %>%
set_names(names(dataframe))
}
res <- outlierremoval(Clean_Data)
head(res)
# X Observation Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Parking City_Category Rainfall House_Price
#1 1 1 9796 5250 10703 1659 1961 Open CAT B 530 6649000
#2 2 2 8294 8186 12694 1461 1752 Not Provided CAT B 210 3982000
#3 3 3 11001 14399 16991 1340 1609 Not Provided CAT A 720 5401000
#4 4 4 8301 11188 12289 1451 1748 Covered CAT B 620 5373000
#5 5 5 10510 12629 13921 1770 2111 Not Provided CAT B 450 4662000
#6 6 6 6665 5142 9972 1442 1733 Open CAT B 760 4526000
更新
如果我们需要获取异常值,在map 步骤中,我们从boxplot.stats 中提取outlier
outliers <- function(dataframe){
dataframe %>%
select_if(is.numeric) %>%
map(~ boxplot.stats(.x)$out)
}
outliers(Clean_Data)
或者用NA 替换异常值(这也将保留行位置)
outlierreplacement <- function(dataframe){
dataframe %>%
map_if(is.numeric, ~ replace(.x, .x %in% boxplot.stats(.x)$out, NA)) %>%
bind_cols
}
outlierreplacement(Clean_Data)