【问题标题】:Return second largest value in a group of columns返回一组列中的第二大值
【发布时间】:2017-02-07 01:53:11
【问题描述】:

我有以下数据框:

dff <- structure(list(`MCI ID` = c("070405344", "230349820", "260386435","370390587", "380406805", "391169282", "440377986", "750391394","890373764", "910367024"), `123a_1` = structure(c(16672, 16372,16730, 16688, 16700, 16783, 16709, 17033, 16786, 16675), class = "Date"),`123a_2` = structure(c(17029, 16422, 17088, 17036, 17057,17140, 17072, 17043, 17141, 17038), class = "Date"), `123a_3` = structure(c(NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,NA_real_, NA_real_, NA_real_), class = "Date"), `123a_4` = structure(c(NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,NA_real_, NA_real_, NA_real_), class = "Date"), `123a_5` = structure(c(NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,NA_real_, NA_real_, NA_real_), class = "Date"), max123a = structure(c(17029,16422, 17088, 17036, 17057, 17140, 17072, 17043, 17141, 17038), class = "Date")), .Names = c("MCI ID", "123a_1", "123a_2","123a_3", "123a_4", "123a_5", "max123a"), row.nam... <truncated>

我已经为123a_1123a_5 的每一行中的最大值设置了一列。为此,我可以使用:

dff <- mutate(dff, max123a = pmax(`123a_1`, `123a_2`, `123a_3`, `123a_4`, `123a_5`, na.rm = T))

但是,现在我需要每行的第二大。这假设在123a_3123a_5 中可能存在NA 以外的数据。理想情况下,我想要一个 dplyr 解决方案,这样我就可以将两个命令连接在一起,但我会接受任何事情。

【问题讨论】:

  • apply(dff, 2, function(x) max(x[x != max(x)]))
  • @d.b,这与apply(dff, 1, function(x) max(x[x != max(x)])) 的轻微修改一起工作,但是它只返回NAs,有没有办法传递na.rm= T 参数?
  • 你的dput 被截断了;再试一次。

标签: r dplyr


【解决方案1】:

使用dplyrtidyr

library(dplyr)
library(tidyr)
dff %>% 
  gather(var, val, 2:6) %>% 
  group_by(`MCI ID`) %>% 
  summarise(max2 = max(val[val != max(val, na.rm = TRUE)], na.rm = TRUE)) %>% 
  left_join(dff, .)

这会导致:

      MCI ID     123a_1     123a_2 123a_3 123a_4 123a_5    max123a       max2
1  070405344 2015-08-25 2016-08-16   <NA>   <NA>   <NA> 2016-08-16 2015-08-25
2  230349820 2014-10-29 2014-12-18   <NA>   <NA>   <NA> 2014-12-18 2014-10-29
3  260386435 2015-10-22 2016-10-14   <NA>   <NA>   <NA> 2016-10-14 2015-10-22
4  370390587 2015-09-10 2016-08-23   <NA>   <NA>   <NA> 2016-08-23 2015-09-10
5  380406805 2015-09-22 2016-09-13   <NA>   <NA>   <NA> 2016-09-13 2015-09-22
6  391169282 2015-12-14 2016-12-05   <NA>   <NA>   <NA> 2016-12-05 2015-12-14
7  440377986 2015-10-01 2016-09-28   <NA>   <NA>   <NA> 2016-09-28 2015-10-01
8  750391394 2016-08-20 2016-08-30   <NA>   <NA>   <NA> 2016-08-30 2016-08-20
9  890373764 2015-12-17 2016-12-06   <NA>   <NA>   <NA> 2016-12-06 2015-12-17
10 910367024 2015-08-28 2016-08-25   <NA>   <NA>   <NA> 2016-08-25 2015-08-28

您可以按照以下方式一起做所有事情:

dff %>% 
  gather(var, val, 2:6) %>% 
  group_by(`MCI ID`) %>% 
  summarise(max2 = max(val[val != max(val, na.rm = TRUE)], na.rm = TRUE)) %>% 
  left_join(dff,.) %>% 
  mutate(max123a = pmax(`123a_1`, `123a_2`, `123a_3`, `123a_4`, `123a_5`, na.rm = TRUE))

基础 R 中的解决方案:

dff$max2 <- apply(dff[2:6], 1, function(x) rev(sort(x))[2])

【讨论】:

    【解决方案2】:

    我们可以使用tidyverse

    library(tidyverse)
    dff %>%
          summarise_each(funs(rev(sort(.))[2]))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-04-20
      • 1970-01-01
      • 2020-05-11
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多