dplyr::select - 多次使用列？答案

【问题标题】：dplyr::select - using column more than once?dplyr::select - 多次使用列？
【发布时间】：2018-05-29 14:27:26
【问题描述】：

select(mtcars,foo=mpg,bar=mpg)

这将返回一个只有一列的数据框 - bar。似乎 dplyr 丢弃了以前出现的列，使得同一列的多个别名变得不可能。漏洞？设计？解决方法？

【问题讨论】：

我很确定这是设计使然。你可以改用mutate
我真的很惊讶它没有失败，它与 mutate 的行为方式不一致 select(mtcars,foo=mpg,bar=foo) 失败，我认为它应该是相反的方式
其实transmute(mtcars,foo=mpg,bar=foo)和transmute(mtcars,foo=mpg,bar=mpg)都工作

标签： r dplyr

【解决方案1】：

解决方法：添加一个使用 foo 创建 bar 的 mutate。

mtcars %>% 
  select(foo = mpg) %>% 
  mutate(bar = foo)

【讨论】：

【解决方案2】：

您可以使用transmute(mtcars, foo = mpg, bar = mpg)（需要注意的是这会删除行名）。

【讨论】：

【解决方案3】：

我不明白为什么每个人都使用dplyr 来解决问题。 Base R 快得多：

更新：我在基础 R 中编写了 myfun4 和 myfun3。前者是可扩展的。后者不是。其他四个函数是dplyr 的解决方案。基准测试显示dplyr 慢了十倍以上：

microbenchmark::microbenchmark(myfun1(),myfun2(),myfun3(),myfun4(),myfun5(),myfun6())
Unit: microseconds
     expr    min      lq      mean  median       uq     max neval
 myfun1() 5356.6 5739.90  6320.338 5967.45  6327.75 11177.7   100
 myfun2() 6208.1 6676.55  7220.770 6941.10  7172.55 10936.3   100
 myfun3() 8645.3 9299.30 10287.908 9676.30 10312.85 15837.1   100
 myfun4() 4426.1 4712.40  5405.235 4866.65  5245.20 12573.2   100
 myfun5()  168.6  250.05   292.472  270.70   303.15  2119.3   100
 myfun6()  141.7  203.15   341.079  237.00   256.45  6278.0   100

代码：

myfun6<-function(){
n=2
res_l<-lapply(1:n,function(j) mtcars$mpg)
res<-data.frame(do.call(cbind,res_l))
rownames(res)=rownames(mtcars)
colnames(res)=c('foo','bar')
}

myfun5<-function(){
res<-data.frame(foo=mtcars$mpg,bar=mtcars$mpg)  
}

myfun4<-function(){
  mtcars %>% 
  select(foo=mpg) %>% 
  bind_cols(bar=.$foo)
}

myfun3<-function(){
res<-map2(c('mpg', 'mpg'), c('foo', 'bar'), ~ mtcars %>% 
          select(!! .y := !! rlang::sym(.x))) %>% 
  bind_cols
}

myfun2<-function(){
  res<-transmute(mtcars, foo = mpg, bar = mpg)
}

myfun1<-function(){
  res<-mtcars %>% 
  select(foo = mpg) %>% 
  mutate(bar = foo)
}

【讨论】：

虽然基本 R 更快，但 dplyr 通常会导致代码更简洁（“整洁”）并提高可读性。如果您必须维护其他人的代码，您更愿意使用myfun6 还是myfun2？我在某处读到机器总是会变得更快，但不可读的代码仍然不可读。
因为这个问题不是关于效率，而是关于dplyr::select 是否按预期工作。因此，IMO 的适当解决方法是使用 dplyr 函数。
你说得对@user。 @ArtemSokolov我的意思是，如果可读性是目标，您是否不想强迫用户以更易于阅读的方式交付他们的代码-例如python?
我觉得微秒在这里没那么重要，但我还是玩了，这个更快：setNames(mtcars[,c("mpg","mpg")],c("foo","bar"))

【解决方案4】：

你也可以

mtcars %>% 
  select(foo=mpg) %>% 
  bind_cols(bar=.$foo)

或

mtcars %>% 
  bind_cols(foo=.$mpg, bar=.$mpg)  
  select(foo, bar)

【讨论】：

【解决方案5】：

我们可以使用

library(tidyverse)
library(rlang)
map2(c('mpg', 'mpg'), c('foo', 'bar'), ~ mtcars %>% 
          select(!! .y := !! rlang::sym(.x))) %>% 
  bind_cols

或者另一种选择是replicateselected 列并将名称设置为所需的名称

replicate(2, mtcars %>%
                   select(mpg))  %>%
      set_names(c('foo', 'bar')) %>%
      bind_cols

【讨论】：