【问题标题】:counts of combinations of values in a dataframe R数据帧 R 中值组合的计数
【发布时间】:2018-08-20 01:24:17
【问题描述】:

我有一个这样的数据框:

    df<-structure(list(id = c("A", "A", "A", "B", "B", "C", "C", "D", 
"D", "E", "E"), expertise = c("r", "python", "julia", "python", 
"r", "python", "julia", "python", "julia", "r", "julia")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -11L), .Names = c("id", 
"expertise"), spec = structure(list(cols = structure(list(id = structure(list(), class = c("collector_character", 
"collector")), expertise = structure(list(), class = c("collector_character", 
"collector"))), .Names = c("id", "expertise")), default = structure(list(), class = c("collector_guess", 
"collector"))), .Names = c("cols", "default"), class = "col_spec"))

df
   id expertise
1   A         r
2   A    python
3   A     julia
4   B    python
5   B         r
6   C    python
7   C     julia
8   D    python
9   D     julia
10  E         r
11  E     julia

我可以通过以下方式获得“专业知识”的总体计数:

library(dplyr)    
df %>% group_by(expertise) %>% mutate (counts_overall= n()) 

但是,我想要的是专长值组合的计数。换句话说,有多少“id”具有相同的两种专业知识组合,例如“r”和“朱莉娅”? 这是所需的输出:

df_out<-structure(list(expertise1 = c("r", "r", "python"), expertise2 = c("python", 
"julia", "julia"), count = c(2L, 2L, 3L)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -3L), .Names = c("expertise1", 
"expertise2", "count"), spec = structure(list(cols = structure(list(
    expertise1 = structure(list(), class = c("collector_character", 
    "collector")), expertise2 = structure(list(), class = c("collector_character", 
    "collector")), count = structure(list(), class = c("collector_integer", 
    "collector"))), .Names = c("expertise1", "expertise2", "count"
)), default = structure(list(), class = c("collector_guess", 
"collector"))), .Names = c("cols", "default"), class = "col_spec"))

df_out
  expertise1 expertise2 count
1          r     python     2
2          r      julia     2
3     python      julia     3

【问题讨论】:

  • 我认为crossprod(table(df)&gt;0) 的非对角线应该这样做。
  • @thelatemail 帖子作为答案?
  • @RonakShah - 寻找副本,因为我知道我从比我聪明的人那里偷了它!
  • 为了得到同样想要的输出格式,我们可以展开@thelatemail 回答:df_out &lt;- crossprod(table(df)&gt;0) %&gt;% melt();colnames(df_out) &lt;- c("exp1", "exp2", "count") ;df_out %&gt;% filter(exp1 != exp2, count &gt; 0) %&gt;% arrange(desc(count));

标签: r dataframe combinations


【解决方案1】:

来自latemail's commentlinked answer 创建一个矩阵

crossprod(table(df) > 0)
         expertise
expertise julia python r
   julia      4      3 2
   python     3      4 2
   r          2      2 3

而 OP 需要一个长格式的数据帧。

1) 交叉连接

下面是一个data.table 解决方案,它使用CJ() (cross join) 功能:

library(data.table)
setDT(df)[, CJ(expertise, expertise)[V1 < V2], by = id][
  , .N, by = .(expertise1 = V1, expertise2 = V2)]
   expertise1 expertise2 N
1:      julia     python 3
2:      julia          r 2
3:     python          r 2

CJ(expertise, expertise)[V1 &lt; V2]data.table 等效于 t(combn(df$expertise, 2))combinat::combn2(df$expertise)

2) 自加入

这是另一个使用自连接的变体:

library(data.table)
setDT(df)[df, on = "id", allow = TRUE][
  expertise < i.expertise, .N, by = .(expertise1 = expertise, expertise2 = i.expertise)]
   expertise1 expertise2 N
1:     python          r 2
2:      julia          r 2
3:      julia     python 3

【讨论】:

    【解决方案2】:

    一种不如交叉表方法有效但易于理解的解决方案:

    library(tidyr)
    
    df %>% group_by(id) %>%
        summarize(expertise = list(combn(sort(expertise), 2, FUN = paste, collapse = '_'))) %>%
        unnest(expertise) %>%
        group_by(expertise) %>%
        summarize(count = n()) %>%
        separate(expertise, c('expertise1', 'expertise2'), sep = '_')
    
    # # A tibble: 3 x 3
    #   expertise1 expertise2 count
    #   <chr>      <chr>      <int>
    # 1 julia      python         3
    # 2 julia      r              2
    # 3 python     r              2
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2021-12-28
      • 1970-01-01
      • 2021-10-16
      • 2021-12-24
      • 2021-04-15
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多