在 sparklyr 中使用 dplyr 计算每列中唯一元素的数量答案

【问题标题】：count number of unique elements in each columns with dplyr in sparklyr在 sparklyr 中使用 dplyr 计算每列中唯一元素的数量
【发布时间】：2026-01-28 20:40:01
【问题描述】：

我正在尝试计算 spark 数据集 s 中每一列中唯一元素的数量。

但是，spark 似乎无法识别 tally() k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(tally(distinct(.))))) Error: org.apache.spark.sql.AnalysisException: undefined function TALLY

spark 似乎也无法识别简单的 r 函数，例如“unique”或“length”。我可以在本地数据上运行代码，但是当我尝试在 spark 表上运行完全相同的代码时，它不起作用。

```

d<-data.frame(cbind(seq(1,10,1),rep(1,10)))
d$group<-rep(c("a","b"),each=5)
d%>%group_by(group)%>%summarise_each(funs(length(unique(.))))
A tibble: 2 × 3
  group    X1    X2
  <chr> <int> <int>
1     a     5     1
2     b     5     1
k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(length(unique(.)))))
Error: org.apache.spark.sql.AnalysisException: undefined function UNIQUE;

```

【问题讨论】：

number of unique values sparklyr 可能重复。

标签： r apache-spark statistics dplyr sparklyr

【解决方案1】：

library(sparklyr)
library(dplyr)
#I am on Spark V. 2.1

#Building example input (local)
d <- data.frame(cbind(seq(1, 10, 1), rep(1,10)))
d$group <- rep(c("a","b"), each = 5)
d

#Spark tbl 
sdf <- sparklyr::sdf_copy_to(sc, d)

# The Answer
sdf %>% 
    group_by(group) %>% 
    summarise_all(funs(n_distinct)) %>%
    collect()

#Output
  group    X1    X2
  <chr> <dbl> <dbl>
1     b     5     1
2     a     5     1

注意：鉴于我们使用的是sparklyr，我选择了dplyr::n_distinct()。次要：dplyr::summarise_each 已弃用。因此，dplyr::summarise_all。

【讨论】：

summarise_each 实际上已被弃用，summarise_all 自 dplyr 0.5.0 起是首选
@Zafar：谢谢：我的代码已经正确，但我在最后的注释中交换了两者。现已编辑。
谢谢你们！我真的很感激！
@StatsBoy 请考虑接受其中一个答案

【解决方案2】：

请记住，当您编写 sparlyr 时，您实际上是在转译为 spark-sql，因此您可能需要不时使用 spark-sql 动词。这是像 count 和 distinct 这样的 spark-sql 动词派上用场的时候之一。

library(sparkylr)

sc <- spark_connect()
iris_spk <- copy_to(sc, iris)

# for instance this does not work in R, but it does in sparklyr
iris_spk %>%
  summarise(Species = distinct(Species))
# or
iris_spk %>%
  summarise(Species = approx_count_distinct(Species))

# this does what you are looking for
iris_spk %>% 
    group_by(species) %>%
    summarise_all(funs(n_distinct))

# for larger data sets this is much faster
iris_spk %>% 
    group_by(species) %>%
    summarise_all(funs(approx_count_distinct))

【讨论】：

谢谢扎法尔！我很感激！
正如 Pasqui 所说，将其中一个标记为最佳答案是个好主意:)