【问题标题】:Top N Values of Cosine Similarity Matrix in RR中余弦相似度矩阵的前N个值
【发布时间】:2021-06-30 17:48:04
【问题描述】:

如何获得如下余弦相似度矩阵的顶部对:

southpark_matrix <- structure(c(0, 0.165272735625452, 0.386480286121192, 0.170696960480773, 
0.0869562860988618, 0.165272735625452, 0, 0.251690602341816, 
0.472701602991984, 0.137486001150133, 0.386480286121192, 0.251690602341816, 
0, 0.255849200006255, 0.0972813221214626, 0.170696960480773, 
0.472701602991984, 0.255849200006255, 0, 0.156449701347234, 0.0869562860988618, 
0.137486001150133, 0.0972813221214626, 0.156449701347234, 0), .Dim = c(5L, 
5L), .Dimnames = list(Docs = c("Mr. Garrison_2", "Cartman_3", 
"Mr. Garrison_3", "Cartman_4", "Jimbo_5"), Docs = c("Mr. Garrison_2", 
"Cartman_3", "Mr. Garrison_3", "Cartman_4", "Jimbo_5")))

南方公园矩阵

                Docs
Docs             Mr. Garrison_2 Cartman_3 Mr. Garrison_3 Cartman_4    Jimbo_5
  Mr. Garrison_2     0.00000000 0.1652727     0.38648029 0.1706970 0.08695629
  Cartman_3          0.16527274 0.0000000     0.25169060 0.4727016 0.13748600
  Mr. Garrison_3     0.38648029 0.2516906     0.00000000 0.2558492 0.09728132
  Cartman_4          0.17069696 0.4727016     0.25584920 0.0000000 0.15644970
  Jimbo_5            0.08695629 0.1374860     0.09728132 0.1564497 0.00000000

如何获得前 2 对?

在本例中,前 2 对将是。在我的实际示例中,我有超过 100 列和行。

Cartman_3 Cartman_4             0.4727016
Mr. Garrison_2 Mr. Garrison_3   0.38648029

【问题讨论】:

    标签: r matrix cosine-similarity


    【解决方案1】:

    我这样做的方法是将矩阵转换为小标题。我们可以按照此处的步骤将矩阵的上三角部分转换为 2 列中的数据框(参见此处:Convert upper triangular part of a matrix to 3-column long format)。

    在此之后,我们可以简单地使用我们的值加权的 top_n(2, val) 函数。此步骤的另一种方法是使用arrange(desc(val)) 将值按降序排列,然后使用head(2) 函数获取前2 个值。

    我在下面生成了我的方法的代表

    library(tidyverse)
    
    southpark_matrix <- structure(c(0, 0.165272735625452, 0.386480286121192, 0.170696960480773, 
                                    0.0869562860988618, 0.165272735625452, 0, 0.251690602341816, 
                                    0.472701602991984, 0.137486001150133, 0.386480286121192, 0.251690602341816, 
                                    0, 0.255849200006255, 0.0972813221214626, 0.170696960480773, 
                                    0.472701602991984, 0.255849200006255, 0, 0.156449701347234, 0.0869562860988618, 
                                    0.137486001150133, 0.0972813221214626, 0.156449701347234, 0), .Dim = c(5L, 
                                                                                                           5L), .Dimnames = list(Docs = c("Mr. Garrison_2", "Cartman_3", 
                                                                                                                                          "Mr. Garrison_3", "Cartman_4", "Jimbo_5"), Docs = c("Mr. Garrison_2", 
                                                                                                                                                                                              "Cartman_3", "Mr. Garrison_3", "Cartman_4", "Jimbo_5")))
    
    # Convert the matrix to an upper diagonal form
    ind <- which(upper.tri(southpark_matrix, diag = TRUE), arr.ind = TRUE)
    dimnam <- dimnames(southpark_matrix)
    df <- data.frame(row = dimnam[[1]][ind[, 1]],
               col = dimnam[[2]][ind[, 2]],
               val = southpark_matrix[ind])
    #top n method
    df %>%
      tibble() %>% 
      top_n(2, val)
    #> # A tibble: 2 x 3
    #>   row            col              val
    #>   <chr>          <chr>          <dbl>
    #> 1 Mr. Garrison_2 Mr. Garrison_3 0.386
    #> 2 Cartman_3      Cartman_4      0.473
    
    #arrange and head method
    df %>% 
      arrange(desc(val)) %>% 
      head(2)
    #> # A tibble: 2 x 3
    #>   row            col              val
    #>   <chr>          <chr>          <dbl>
    #> 1 Cartman_3      Cartman_4      0.473
    #> 2 Mr. Garrison_2 Mr. Garrison_3 0.386
    

    reprex package (v2.0.0) 于 2021-04-04 创建

    【讨论】:

      【解决方案2】:

      lapply:

      N <- 2
      best <- head(sort(southpark_matrix[upper.tri(southpark_matrix)], decreasing = TRUE),N)
      lapply(best, function(x) {
        list(similarity = x, names = rownames(which(southpark_matrix == x, arr.ind = TRUE)))
      })
      #> [[1]]
      #> [[1]]$similarity
      #> [1] 0.4727016
      #> 
      #> [[1]]$names
      #> [1] "Cartman_4" "Cartman_3"
      #> 
      #> 
      #> [[2]]
      #> [[2]]$similarity
      #> [1] 0.3864803
      #> 
      #> [[2]]$names
      #> [1] "Mr. Garrison_3" "Mr. Garrison_2"
      

      【讨论】:

        猜你喜欢
        • 2021-08-20
        • 2012-07-09
        • 1970-01-01
        • 2015-07-17
        • 2019-12-23
        • 2020-10-28
        • 2014-03-25
        • 2017-06-13
        • 2016-09-08
        相关资源
        最近更新 更多