【问题标题】:Merge two columns with different structures in R合并R中具有不同结构的两列
【发布时间】:2022-01-16 20:31:45
【问题描述】:

如何合并两个结构不同的表?就像,我想将“ds_categories”中的类别名称与“ds”中的类别ID交叉。 ds 有 1109383 行,ds_categories 有 2775 行。更具体地说,我想将类别名称链接到类别 ID。 这个完整的数据库位于 kaggle:https://www.kaggle.com/sp1thas/book-depository-dataset/code

ds_categories

 category_id  category_name

1   1998    .Net Programming        
2   176     20th Century & Contemporary Classical Music     
3   3291    20th Century & Contemporary Classical Music     
4   2659    20th Century History: C 1900 To C 2000      
5   2661    21st Century History: From C 2000 -     
6   1992    2D Graphics: Games Programming

ds

authors bestsellers.rank categories
1   [1]     49848       [214, 220, 237, 2646, 2647, 2659, 2660, 2679]   
2   [2, 3]  115215      [235, 3386] 
3   [4]     11732       [358, 2630, 360, 2632]  
4   [5, 6, 7, 8]114379  [377, 2978, 2980]   
5   [9]      98413      [2813, 2980]    
6   [10, 11]    90674   [1520, 1532]

我试过了,但没用:

join_cat <- merge(ds, ds_categories, by.x = "categories", by.y = "category_id", all.x = TRUE, all.y = FALSE)

【问题讨论】:

    标签: r dplyr tidyverse


    【解决方案1】:

    您需要先执行一些数据清理,将每个categories 的数据放在单独的行中,然后执行连接。

    library(dplyr)
    library(tidyr)
    
    ds %>%
      mutate(categories = gsub('\\[|\\]', '', categories)) %>%
      separate_rows(categories, sep = ',\\s*', convert = TRUE) %>%
      left_join(ds_categories, by = c('categories' = 'category_id'))
    
    #   authors bestsellers.rank categories category_name                                      
    #   <chr>              <dbl>      <int> <chr>                                              
    # 1 [1]                49848        214 Biography: General                                 
    # 2 [1]                49848        220 Biography: Historical, Political & Military        
    # 3 [1]                49848        237 True War  & Combat Stories                         
    # 4 [1]                49848       2646 Asian History                                      
    # 5 [1]                49848       2647 Middle Eastern History                             
    # 6 [1]                49848       2659 20th Century History: C 1900  To C 2000            
    # 7 [1]                49848       2660 Postwar 20th Century History, From C 1945 To C 2000
    # 8 [1]                49848       2679 Military History                                   
    # 9 [2, 3]            115215        235 True Crime Biographies                             
    #10 [2, 3]            115215       3386 True Crime Books                  
    

    数据

    ds_categories <- read.csv('categories.csv')
    ds <- data.frame(authors = c('[1]', '[2, 3]'), 
                          bestsellers.rank = c(49848, 115215), 
                          categories = c('[214, 220, 237, 2646, 2647, 2659, 2660, 2679]', 
                                         '[235, 3386]'))
    

    【讨论】:

    • 它的作品@ronakshah !谢谢!
    【解决方案2】:

    取消嵌套与多个类别相关的行后,加入变得更加容易:

    library(tidyverse)
    
    # create some example data
    ds_categories <- tribble(
      ~category_id,  ~category_name,
      1, "cat A",
      2, "cat B",
      3, "cat C"
    )
    
    ds <- tribble(
      ~authors, ~categories,
      c(1,2), c(1,2),
      3, 1,
      4, c(1,2,3)
    )
    
    ds %>%
      unnest(authors) %>%
      unnest(categories) %>%
      rename(category_id = categories) %>%
      left_join(ds_categories)
    #> Joining, by = "category_id"
    #> # A tibble: 8 x 3
    #>   authors category_id category_name
    #>     <dbl>       <dbl> <chr>        
    #> 1       1           1 cat A        
    #> 2       1           2 cat B        
    #> 3       2           1 cat A        
    #> 4       2           2 cat B        
    #> 5       3           1 cat A        
    #> 6       4           1 cat A        
    #> 7       4           2 cat B        
    #> 8       4           3 cat C
    

    reprex package (v2.0.1) 于 2021 年 12 月 13 日创建

    尝试始终将您的表格标准化为 3NF,也就是使它们整洁。

    【讨论】:

      猜你喜欢
      • 2018-03-11
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多