【问题标题】:How can I select columns based on two conditions? [duplicate]如何根据两个条件选择列? [复制]
【发布时间】:2019-08-14 21:44:58
【问题描述】:

我有一个包含很多列的数据框。例如:

sample treatment col5 col6 col7
  1        a       3    0   5  
  2        a       1    0   3
  3        a       0    0   2
  4        b       0    1   1

我想选择 sampletreatment 列以及满足以下 2 个条件的所有列:

  1. treatment == 'b'所在行的值为0
  2. 至少一行treatment == 'a' 的值 0。

预期的结果应该是这样的:

sample treatment col5
  1        a       3      
  2        a       1      
  3        a       0      
  4        b       0       

示例数据框:

structure(list(sample = 1:4, treatment = structure(c(1L, 1L, 
1L, 2L), .Label = c("a", "b"), class = "factor"), col5 = c(3, 
1, 0, 0), col6 = c(0, 0, 0, 1), col7 = c(5, 3, 2, 1)), class = "data.frame", row.names = c(NA, 
-4L))

【问题讨论】:

  • Filter data.frame rows by a logical condition 的可能重复项;要向重复问题添加附加条件,只需在条件之间添加&,将每个条件写为data[cond1 & cond2, ]
  • 对不起,我是这个网站的新手,你的评论是什么意思?我检查了你提到的问题,但那是选择行而不是列。
  • 对不起,我匆匆忙忙。从 Shree 的回答中可以看出,选择列可以通过从 data[cond1 & cond2, ] 切换到 data[, cond1 & cond2] 来完成(注意逗号的位置)。基本上,您将data.frames、matrix 索引为data[row, column],其中row 是您要提取的条件或行号,与列参数类似。查看Hadley's Advanced R 在线图书。尽管它的名字是第一次查看必要的章节,例如 subsetting 可以让你走得更远,而无需阅读高级内容。

标签: r dplyr subset


【解决方案1】:

这是基础 R 中的一种方式 -

cs_a <- colSums(df[df$treatment == "a",-c(1:2)]) > 0
cs_b <- colSums(df[df$treatment == "b",-c(1:2)]) == 0

df[, c(TRUE, TRUE, cs_a & cs_b)]

  sample treatment col5
1      1         a    3
2      2         a    1
3      3         a    0
4      4         b    0

dplyr -

df %>% 
  select_at(which(c(TRUE, TRUE, cs_a & cs_b)))

【讨论】:

  • 是否可以使用 dplyr::select?
【解决方案2】:

这里是 tidyverse 中更冗长的方式,不需要手动 colSums 来处理每个级别的处理:

library(dplyr)
library(purrr)
library(tidyr)

sample <- 1:4
treatment <- c("a", "a", "a", "b")
col5 <- c(3,1,0,0)
col6 <- c(0,0,0,1)
col7 <- c(5,3,2,1)

dd <- data.frame(sample, treatment, col5, col6, col7)
# first create new columns that report whether the entries are zero
dd2 <- mutate_if(
  .tbl = dd,
  .predicate = is.numeric,
  .funs = function(x)
    x == 0
)

# then find the sum per column and per treatment group
# in R TRUE = 1 and FALSE = 0
number_of_zeros <- dd2 %>% 
  group_by(treatment) %>% 
  summarise_at(.vars = vars(col5:col7), .funs = "sum")

# then find the names of the columns you want to keep
keeper_columns <-
  number_of_zeros %>% 
  select(-treatment) %>% # remove the treatment grouping variable
  map_dfr( # function to check if all entries per column (now per treatment level) are greater zero
    .x = .,
    .f = function(x)
      all(x > 0)
  ) %>% 
  gather(column, keeper) %>% # reformat 
  filter(keeper == TRUE) %>% # to grab the keepers
  select(column) %>% # then select the column with column names
  unlist %>% # and convert to character vector
  unname

# subset the original dataset for the wanted columns
wanted_columns <- dd %>% select(1:2, keeper_columns)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-09-30
    • 1970-01-01
    • 1970-01-01
    • 2011-11-12
    • 1970-01-01
    • 2013-06-02
    相关资源
    最近更新 更多