在“dplyr”库中使用“select”函数选择唯一值答案

【问题标题】：Select unique values with 'select' function in 'dplyr' library在“dplyr”库中使用“select”函数选择唯一值
【发布时间】：2014-10-23 15:48:18
【问题描述】：

是否可以使用dplyr 库中的select 函数从data.frame 的列中选择所有唯一值？ SQL 表示法中的“SELECT DISTINCT field1 FROM table1”之类的东西。

谢谢！

【问题讨论】：

标签： r select unique dplyr

【解决方案1】：

只是添加到其他答案，如果您希望返回向量而不是数据框，您有以下选项：

dplyr >= 0.7.0

使用pull 动词：

mtcars %>% distinct(cyl) %>% pull()

dplyr

将 dplyr 函数用括号括起来，并与$ 语法结合起来：

(mtcars %>% distinct(cyl))$cyl

【讨论】：

【解决方案2】：

在 dplyr 0.3 中，这可以使用distinct() 方法轻松实现。

这是一个例子：

distinct_df = df %>% distinct(field1)

您可以通过以下方式获得不同值的向量：

distinct_vector = distinct_df$field1

您还可以在执行 distinct() 调用的同时选择列的子集，如果您使用 head/tail/glimpse 检查数据框，则可以更清晰地查看。：

distinct_df = df %>% distinct(field1) %>% select(field1) distinct_vector = distinct_df$field1

【讨论】：

如果数据框已经在 R 中，则此方法有效，但如果您尝试通过 db 连接（即src_postgres()）直接在数据库上执行查询，则此方法无效。它报告：Error: Can't calculate distinct only on specified columns with SQL
查看这个问题了解如何连接 src_postgres() 和 dplyr stackoverflow.com/questions/21592266/…
请注意 distinct() 的工作方式在 dplyr 0.5 中发生了变化。默认情况下，distinct() 现在只返回用作distinct() 参数的列。如果要保留其他列，现在必须将 .keep_all = TRUE 作为附加参数传递给 distinct()
是的，dplyr 0.5 破坏了我之前使用 0.3 和 distinct 编写的代码。为什么改变？以前的默认行为很有用，而且是自然的做法。

【解决方案3】：

dplyrselect 函数从数据框中选择特定列。要返回特定数据列中的唯一值，您可以使用group_by 函数。例如：

library(dplyr)

# Fake data
set.seed(5)
dat = data.frame(x=sample(1:10,100, replace=TRUE))

# Return the distinct values of x
dat %>%
  group_by(x) %>%
  summarise() 

    x
1   1
2   2
3   3
4   4
5   5
6   6
7   7
8   8
9   9
10 10

如果要更改列名，可以添加以下内容：

dat %>%
  group_by(x) %>%
  summarise() %>%
  select(unique.x=x)

这既从dplyr 返回的数据框中的所有列中选择列x（当然在这种情况下只有一列），并将其名称更改为unique.x。

您还可以使用unique(dat$x) 直接在基础R 中获取唯一值。

如果你有多个变量，并且想要数据中出现的所有唯一组合，可以将上面的代码概括如下：

set.seed(5)
dat = data.frame(x=sample(1:10,100, replace=TRUE), 
                 y=sample(letters[1:5], 100, replace=TRUE))

dat %>% 
  group_by(x,y) %>%
  summarise() %>%
  select(unique.x=x, unique.y=y)

【讨论】：

或者使用dplyr 0.3中新的distinct()函数