根据多行中的值过滤 R 中的行答案

【问题标题】：Filter rows in R based on values in multiple rows根据多行中的值过滤 R 中的行
【发布时间】：2018-01-07 00:44:15
【问题描述】：

我正在尝试过滤掉 R 中不需要的多行数据，但我不知道该怎么做。

我使用的数据看起来有点像这样：

  Category     Item Shop1 Shop2 Shop3
1    Fruit   Apples     4     6     0
2    Fruit  Oranges     0     2     7
3      Veg Potatoes     0     0     0
4      Veg   Onions     0     0     0
5      Veg  Carrots     0     0     0
6    Dairy  Yoghurt     0     0     0
7    Dairy     Milk     0     1     0
8    Dairy   Cheese     0     0     0

我只想保留至少一件商品对至少一家商店具有正价值的类别。

在这种情况下，我想删除所有Veg 行，因为没有一家商店出售任何蔬菜。我想保留所有 Fruit 行，并且我想保留 all Dairy 行，即使是所有商店中值为零的行，因为 Dairy 行之一确实如此值大于 0。

我在使用group_by(Category) 之后尝试使用colSums，希望它每次都能将Category 的内容相加，但它不起作用。我还尝试在 rowSums 的末尾添加一列并根据频率进行过滤，但我只能通过这种方式过滤掉单个行，而不是基于整个 Category 的行。

虽然我可以过滤掉值为零的单个行（例如第 3 行），但我的困难是保留第 6 行和第 8 行这样的行，其中每个商店的所有值都为零，但我想保留这些行，因为其他 Dairy 行的值确实大于零。

【问题讨论】：

标签： r dataframe dplyr

【解决方案1】：

1) 子集/ave rowSums(...) > 0 每一行都有一个元素。如果该行中有非零，则该元素为 TRUE。它假定负值是不可能的。（如果可能出现负值，则改用rowSums(DF[-1:-2]^2) > 0。）它还假设商店是前两列之后的那些列。特别是，它适用于任意数量的商店。然后ave 为这些值中的any 为TRUE 且subset 仅保留这些值的组生成一个TRUE。没有使用任何包。

subset(DF, ave(rowSums(DF[-1:-2]) > 0, Category, FUN = any))

给予：

  Category    Item Shop1 Shop2 Shop3
1    Fruit  Apples     4     6     0
2    Fruit Oranges     0     2     7
6    Dairy Yoghurt     0     0     0
7    Dairy    Milk     0     1     0
8    Dairy  Cheese     0     0     0

1a) 如果您不介意对商店进行硬编码，则可以使用以下变体：

subset(DF, ave(Shop1 + Shop2 + Shop3 > 0, Category, FUN = any))

2) dplyr

library(dplyr)
DF %>% group_by(Category) %>% filter(any(Shop1, Shop2, Shop3)) %>% ungroup

给予：

# A tibble: 5 x 5
# Groups:   Category [2]
  Category    Item Shop1 Shop2 Shop3
    <fctr>  <fctr> <int> <int> <int>
1    Fruit  Apples     4     6     0
2    Fruit Oranges     0     2     7
3    Dairy Yoghurt     0     0     0
4    Dairy    Milk     0     1     0
5    Dairy  Cheese     0     0     0

3) 过滤/拆分另一个基本解决方案是：

do.call("rbind", Filter(function(x) any(x[-1:-2]), split(DF, DF$Category)))

给予：

        Category    Item Shop1 Shop2 Shop3
Dairy.6    Dairy Yoghurt     0     0     0
Dairy.7    Dairy    Milk     0     1     0
Dairy.8    Dairy  Cheese     0     0     0
Fruit.1    Fruit  Apples     4     6     0
Fruit.2    Fruit Oranges     0     2     7

4) dplyr/tidyr 使用gather 将数据转换为长格式，其中每个值有一行，然后使用any 过滤组。最后转换回宽格式。

library(dplyr)
library(tidyr)
DF %>% 
  gather(shop, value, -(Category:Item)) %>% 
  group_by(Category) %>% 
  filter(any(value)) %>% 
  ungroup %>% 
  spread(shop, value)

给予：

# A tibble: 5 x 5
  Category    Item Shop1 Shop2 Shop3
*   <fctr>  <fctr> <int> <int> <int>
1    Dairy  Cheese     0     0     0
2    Dairy    Milk     0     1     0
3    Dairy Yoghurt     0     0     0
4    Fruit  Apples     4     6     0
5    Fruit Oranges     0     2     7

注意：可重现形式的输入是：

Lines <- "  Category     Item Shop1 Shop2 Shop3
1    Fruit   Apples     4     6     0
2    Fruit  Oranges     0     2     7
3      Veg Potatoes     0     0     0
4      Veg   Onions     0     0     0
5      Veg  Carrots     0     0     0
6    Dairy  Yoghurt     0     0     0
7    Dairy     Milk     0     1     0
8    Dairy   Cheese     0     0     0"

DF <- read.table(text = Lines)

【讨论】：

那太好了：给ave一个逻辑向量作为它的第一个参数，然后最终的输出可以直接用于子集。
哇，感谢您的多种解决方案和清晰的解释！

【解决方案2】：

这是基于 R 的一个方法，带有 rowSums、ave 和 [。

dat[ave(rowSums(dat[grep("Shop", names(dat))]), dat$Category, FUN=max) > 0,]

rowSums 计算商店变量中每一行的销售额（使用grep 子集）。结果向量被馈送到ave，它按dat$Category 分组，并返回每个的最大销售额。最后，原始 data.frame 是基于销售额是否为正数的子集。

  Category    Item Shop1 Shop2 Shop3
1    Fruit  Apples     4     6     0
2    Fruit Oranges     0     2     7
6    Dairy Yoghurt     0     0     0
7    Dairy    Milk     0     1     0
8    Dairy  Cheese     0     0     0

数据

dat <-
structure(list(Category = structure(c(2L, 2L, 3L, 3L, 3L, 1L, 
1L, 1L), .Label = c("Dairy", "Fruit", "Veg"), class = "factor"), 
    Item = structure(c(1L, 6L, 7L, 5L, 2L, 8L, 4L, 3L), .Label = c("Apples", 
    "Carrots", "Cheese", "Milk", "Onions", "Oranges", "Potatoes", 
    "Yoghurt"), class = "factor"), Shop1 = c(4L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L), Shop2 = c(6L, 2L, 0L, 0L, 0L, 0L, 1L, 0L
    ), Shop3 = c(0L, 7L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("Category", 
"Item", "Shop1", "Shop2", "Shop3"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8"))

【讨论】：

不错。我正要发帖df[!!ave(rowSums(df[3:5]), df$Category, FUN = function(i) sum(i) > 0),]