条件非等连接答案

【问题标题】：Conditional non-equi join条件非等连接
【发布时间】：2018-03-10 20:36:28
【问题描述】：

library(tidyverse)

iris <- iris

means <- iris %>% 
  group_by(Species) %>% 
  summarise_all(funs(mean))

sd <- iris %>% 
  group_by(Species) %>% 
  summarise_all(funs(sd))

bottom <- means[ ,2:5] - sd[ ,2:5]
bottom$Species <- c("setosa", "versicolor", "virginica")
print(bottom)
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
 1     4.653510    3.048936     1.288336   0.1406144     setosa
 2     5.419829    2.456202     3.790089   1.1282473 versicolor
 3     5.952120    2.651503     5.000105   1.7513499  virginica

top <- means[ ,2:5] + sd[ ,2:5]
top$Species <- c("setosa", "versicolor", "virginica")
print(top)
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
 1     5.358490    3.807064     1.635664   0.3513856     setosa
 2     6.452171    3.083798     4.729911   1.5237527 versicolor
 3     7.223880    3.296497     6.103895   2.3006501  virginica

如何获取鸢尾花的行，其中 Sepal.Length、Sepal.Width、Petal.Length 和 Petal.Width 的值都位于顶部和底部数据帧中的值之间？

例如，我只想要 Sepal.Length > 4.65 & Sepal.Length

谢谢。

【问题讨论】：

标签： r dplyr data.table

【解决方案1】：

如果您可以在没有汇总步骤的情况下从头开始过滤，那会容易得多：

iris %>% 
    group_by(Species) %>% 
    filter_if(is.numeric, all_vars(. < mean(.) + sd(.) & . > mean(.) - sd(.)))

# A tibble: 54 x 5
# Groups:   Species [3]
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.7         3.2          1.3         0.2  setosa
# 3          5.0         3.6          1.4         0.2  setosa
# 4          5.0         3.4          1.5         0.2  setosa
# 5          4.8         3.4          1.6         0.2  setosa
# 6          5.1         3.5          1.4         0.3  setosa
# 7          5.1         3.8          1.5         0.3  setosa
# 8          5.2         3.5          1.5         0.2  setosa
# 9          5.2         3.4          1.4         0.2  setosa
#10          4.7         3.2          1.6         0.2  setosa
# ... with 44 more rows

_{不确定是否可以避免汇总步骤，请在此处作为选项发布。}

或者使用between:

iris %>% 
    group_by(Species) %>% 
    filter_if(is.numeric, all_vars(between(., mean(.) - sd(.), mean(.) + sd(.))))

【讨论】：

【解决方案2】：

这是一个使用 non-equi joins 的解决方案，它基于 (now deleted) approach of @Frank:

library(data.table)

# add a row number column and to reshape from wide to long
DT <- melt(data.table(iris)[, rn := .I], id = c("rn", "Species"))

# compute lower and upper bound for each variable and Species
mDT <- DT[, .(lb = lb <- mean(value) - (s <- sd(value)),
              ub = lb + 2 * s), by = .(Species, variable)]

# find row numbers of items which fulfill conditions
selected_rn <-
  # non-equi join 
  DT[DT[mDT, on = .(Species, variable, value > lb, value < ub), which = TRUE]][
    # all uniqueN(mDT$variable) variables must have been selected 
    # for an item to pass (thanks to @Frank for tip to avoid hardcoded value)
    , .N, by = rn][N == uniqueN(mDT$variable), rn]

head(iris[sort(selected_rn),])

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
8           5.0         3.4          1.5         0.2  setosa
12          4.8         3.4          1.6         0.2  setosa
18          5.1         3.5          1.4         0.3  setosa

【讨论】：