【问题标题】:Creating an igraph graph from dplyr grouped data从 dplyr 分组数据创建 igraph 图
【发布时间】:2023-10-29 16:08:01
【问题描述】:

我的目标是创建一个 igraph 图形对象,以后可以用它来绘制 ggraph。

我的整洁数据是包含不同数量项目的发票。 n 是原始样本中恰好一张发票的出现次数。例如,在以下包含面包、黄油和鸡蛋的发票类型 1 中,开具了 10 次发票。

#library(tidyverse)
data <- tibble(invoicetype = c(1,1,1,2,2,3,3,4,4,4,4,4,5,5,6,7,7,8,8,8,9,9), 
               item = c("bread", "butter", "eggs", "bread", "coke", "coke", "eggs", 
                        "bread", "butter","coke", "pasta", "water", "coke", "water", 
                        "coke", "bread", "butter", "eggs", "coke", "water", "pasta", 
                        "bread"),
               n = c(10,10,10,8,8,7,7,4,4,4,4,4,3,3,3,2,2,1,1,1,1,1))

我想创建一个 igraph 对象,该对象会考虑每个项目在同一张发票上与任何其他项目组合的次数。

问题:有没有简单的方法可以做到这一点?

我的繁琐解决方案:

以下是我提出的解决方案,但并不优雅,不适用于我的实际(大)数据。

data_spreaded <- data %>% group_by(invoicetype, n) %>% 
  summarise(item1 = item[1], item2 = item[2], item3 = item[3], 
            item4 = item[4], item5 = item[5])

combinations <- tibble()
for (g in 1:nrow(data_spreaded)) {
  for (i in 3:ncol(data_spreaded)) {
    for (j in 3:ncol(data_spreaded)) {
      if (i == j) { next }
      combinations <- 
        bind_rows(combinations,
                  tibble(from = data_spreaded[g,i] %>% pull(),
                         to = data_spreaded[g,j] %>% pull(),
                         invoicetype = data_spreaded[g,1] %>% pull(),
                         n = data_spreaded[g,2]%>% pull()))
    }
  }
}

combinations <- combinations %>% 
  distinct() %>% # remove the double counted
  filter(!is.na(from), !is.na(to)) %>% # remove empty combinations
  group_by(from, to) %>% 
  summarise(n = sum(n)) %>% 
  ungroup()

#library(igraph)
g <- graph_from_data_frame(combinations, directed = F)

要使用 ggraph 绘图,我使用:

E(g)$weight <- combinations$n

#library(ggraph)
set.seed(123)
ggraph(g, layout = "with_kk") + 
  geom_node_point() + 
  geom_node_text(aes(label = name), repel = T) +
  geom_edge_link(aes(color = weight, label = n))

【问题讨论】:

    标签: r dplyr igraph


    【解决方案1】:

    如果您只是将数据连接到自身,可以节省大量时间。很多边缘列表都遵循这种类型的工作流程:

    combo <- data %>%
      #join the data to itself
      left_join(data, by = c('invoicetype', 'n')) %>%
      #this is undirected so x %--% y is the same as y %--% x
      filter(item.x < item.y) %>%
      group_by(item.x, item.y) %>%
      summarize(n = sum(n))
    

    剧情是这样的

    g <- graph_from_data_frame(combo2, directed = F)
    
    g_strength <- strength(g, weights = E(g)$n)
    
    set.seed(1234)
    plot(g,
         edge.width = E(g)$n/max(E(g)$n) * 10,
         vertex.size = g_strength/max(g_strength) * 20)
    

    希望对你有帮助

    【讨论】:

      【解决方案2】:

      我通常会为类似的情况量身定制这样的东西。

      library(tidyverse)
      
      data <- tibble(invoicetype = c(1,1,1,2,2,3,3,4,4,4,4,4,5,5,6,7,7,8,8,8,9,9), 
                     item = c("bread", "butter", "eggs", "bread", "coke", "coke", "eggs", 
                              "bread", "butter","coke", "pasta", "water", "coke", "water", 
                              "coke", "bread", "butter", "eggs", "coke", "water", "pasta", 
                              "bread"),
                     n = c(10,10,10,8,8,7,7,4,4,4,4,4,3,3,3,2,2,1,1,1,1,1))
      
      
      data %>% 
        mutate(item2 = item) %>%                      # make a second item column
        group_by(invoicetype) %>%                     
        expand(item, item2, nesting(n)) %>%           # get all in-group combinations
        ungroup() %>%
        filter(item != item2) %>%                     # drop loops
        mutate(from = map2_chr(item, item2, min),     # for undirected, sort dyad's names...
               to = map2_chr(item, item2, max)) %>%   # ... alphabetically
        distinct(from, to, n) %>%                     # drop duplicate rows and unused columns
        group_by(from, to) %>% 
        summarise(weight = sum(n)) %>%
        ungroup()
      
      #> # A tibble: 14 x 3
      #>    from   to     weight
      #>    <chr>  <chr>   <dbl>
      #>  1 bread  butter     16
      #>  2 bread  coke       12
      #>  3 bread  eggs       10
      #>  4 bread  pasta       5
      #>  5 bread  water       4
      #>  6 butter coke        4
      #>  7 butter eggs       10
      #>  8 butter pasta       4
      #>  9 butter water       4
      #> 10 coke   eggs        8
      #> 11 coke   pasta       4
      #> 12 coke   water       8
      #> 13 eggs   water       1
      #> 14 pasta  water       4
      

      【讨论】:

      • 我喜欢你使用的expand 方法。这是 left_joining 的一个不错的选择
      • 我认为使用 expand 更简洁,我通常会制作一个单独的 dyad list 列用于排序。也就是说,如果数据很大,left_joining 会明显更快。