【问题标题】:how to detect outliers in the columns of a dataframe? in R如何检测数据框列中的异常值?在 R 中
【发布时间】:2013-04-11 23:00:25
【问题描述】:

我有一个数据框,假设是这样的:

names<-c("a","a","a","a","a","b","b","b","b","b","c","c","c","c","c","c","c","c")
var1<-c(0.942999593,0.935507266,0.973589623,0.969415912,0.95230801,0.935507266,0.888740961,0.91750551,0.944482672,0.945468585,1.457579147,0.922206277,0.941511433,0.954724791,0.941014244,0.941511433,0.941511433,1.50511433)
var2<-c(-0.012678088,0.014313763,0.001138275,-0.020568206,0.012987126,0.001217192,0.03360358,0.009758172,0.015066932,-0.037879492,0.020471157,0.010738162,0.010952531,0.019377213,0.027140572,0.031116892,-0.018530676,-8.90E-05)
as.data.frame(cbind(names,var1,var2))->df

我想将 var1 和 var2 列中的异常值转换为 Na。但是,我想为“名称”列中的每个类别独立计算异常值。因此,var1 中“a”的异常值将是仅使用 var1 中的前 5 行找到的异常值。

我检测异常值的方式是所有值,分别低于或高于分位数 0.25 和 0.75。

在 R 中有没有简单的方法来做到这一点?

非常感谢您。

蒂娜。

【问题讨论】:

  • 您可以通过变量names 在您的df 上使用split() 并检测您的外层(无论您如何定义它)。
  • @JuliánUrbano:怎么可能不呢?她要求提供分位数,而不是绝对数值
  • @CarlWitthoft 对..没看错 ;)

标签: r dataframe outliers


【解决方案1】:

以下是 var1 的操作方法:

quantiles<-tapply(var1,names,quantile)
minq <- sapply(names, function(x) quantiles[[x]]["25%"])
maxq <- sapply(names, function(x) quantiles[[x]]["75%"])
var1[var1<minq | var1>maxq] <- NA

对 var2(或 df$var2)重复相同的操作。

【讨论】:

  • 我要说的是,但这更简洁(注意minq 与关卡匹配的便捷方式)。
猜你喜欢
  • 2013-10-09
  • 2018-03-04
  • 2018-12-03
  • 1970-01-01
  • 2014-09-05
  • 2021-11-13
  • 2017-10-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多