返回 top_n 后的原始数据集答案

【问题标题】：Return original data set after top_n返回 top_n 后的原始数据集
【发布时间】：2019-12-05 09:22:40
【问题描述】：

给定一个数据集，我们可以使用top_n 来限制我们在tidyverse 中返回的行数（即排序/排名）。我喜欢大多数tidyverse 操作的灵活性，因为它们在大多数情况下都可以撤消，即您可以回到开始的地方。

使用此处问题中的数据和可能的解决方案（我写的），我怎样才能最好地撤消top_n？。

数据：

df<-structure(list(milk = c(1L, 2L, 1L, 0L, 4L), bread = c(4L, 5L, 
2L, 1L, 10L), juice = c(3L, 4L, 6L, 5L, 2L), honey = c(1L, 2L, 
0L, 3L, 1L), eggs = c(4L, 4L, 7L, 3L, 5L), beef = c(2L, 3L, 0L, 
1L, 8L)), class = "data.frame", row.names = c(NA, -5L))

代码：

df %>% 
  gather(key,value) %>% 
  group_by(key) %>% 
  summarise(Sum=sum(value)) %>% 
  arrange(desc(Sum)) %>% 
  top_n(3,Sum) %>% 
  ungroup()

上面给了我这个：

# A tibble: 3 x 2
  key     Sum
  <chr> <int>
1 eggs     23
2 bread    22
3 juice    20

现在我将（学习如何）做的是返回原始数据集而不删除代码，即以编程方式从top_n 恢复：

我自然想到了spreading（res就是上面的结果）：

 spread(res,key,Sum)
# A tibble: 1 x 3
  bread  eggs juice
  <int> <int> <int>
1    22    23    20

但是，如何从那个开始或撤消top_n 的替代解决方案只是无法想到（还）。我怎样才能最好地做到这一点？

【问题讨论】：

你的意思是如何从每组的总和到每组的所有原始个体值？
top_n 不是 filter，您无法撤消吗？
你可以先cols <- df %>% gather(key, value) %>% group_by(key) %>% summarise(Sum = sum(value)) %>% arrange(desc(Sum)) %>% top_n(3, Sum) %>% ungroup() %>% pull(key) 然后df %>% select(one_of(cols))。
理论上也可以df %>% select(one_of(df %>% gather(key, value) %>% group_by(key) %>% summarise(Sum = sum(value)) %>% arrange(desc(Sum)) %>% top_n(3, Sum) %>% ungroup() %>% pull(key))).
甚至可能是df %>% gather(key, value) %>% group_by(key) %>% summarise(Sum = sum(value), Values = list(value), Row = list(row_number())) %>% arrange(desc(Sum)) %>% top_n(3, Sum) %>% select(-Sum) %>% ungroup() %>% unnest() %>% spread(key, Values)。

标签： r dplyr tidyr

【解决方案1】：

使用pull 的类似想法，但方法略有不同：

library(tidyverse)

df %>%
  summarise_all(sum) %>%  # Your method of selecting 
  gather(key, val) %>%    # top three columns 
  top_n(3) %>%            # 
  arrange(-val) %>%       #
  pull(key) %>%           # pull 'key'
  select(df, .)           # select cols from df by `.`

#  eggs bread juice
#1    4     4     3
#2    4     5     4
#3    7     2     6
#4    3     1     5
#5    5    10     2

并且，从上一个问题发展思路：

df[, '['(names(sort(colSums(df), T)), 1:3)]

结果相同。

【讨论】：

【解决方案2】：

这是一个非常密集的基础 R 解决方案：

df[, rank(-colSums(df))[1:3]]
  eggs bread juice
1    4     4     3
2    4     5     4
3    7     2     6
4    3     1     5
5    5    10     2

【讨论】：

【解决方案3】：

不一定是相反的过程，但是，一种可能是根据列名进行选择：

df %>% 
  gather(Key, Value) %>% 
  group_by(Key) %>%
  summarise(Sum = sum(Value)) %>% 
  arrange(desc(Sum)) %>%
  top_n(3, Sum) %>%
  ungroup() %>%
  pull(Key) %>% 
  {select(df, one_of(.))}

  eggs bread juice
1    4     4     3
2    4     5     4
3    7     2     6
4    3     1     5
5    5    10     2

或者将值和行号放入列表中，然后取消嵌套然后展开的可能性：

df %>% 
 gather(Key, Value) %>% 
 group_by(Key) %>%
 summarise(Sum = sum(Value),
           Values = list(Value),
           Row_ID = list(row_number())) %>% 
 arrange(desc(Sum)) %>% 
 top_n(3, Sum) %>%
 select(-Sum) %>%
 ungroup() %>%
 unnest() %>%
 spread(Key, Values) %>%
 select(-Row_ID)

  bread  eggs juice
  <int> <int> <int>
1     4     4     3
2     5     4     4
3     2     7     6
4     1     3     5
5    10     5     2

【讨论】：

谢谢，我会在几个小时或一天后接受答复。只是保持打开状态，以防其他人有替代方案。
@tmfmnk 我认为 one_of() 的答案是迄今为止最好的。 +1 我对其进行了一些编辑以提高可读性。如果您不同意我的更改，请随时回滚。