合并R中具有不同结构的两列答案

【问题标题】：Merge two columns with different structures in R合并R中具有不同结构的两列
【发布时间】：2022-01-16 20:31:45
【问题描述】：

如何合并两个结构不同的表？就像，我想将“ds_categories”中的类别名称与“ds”中的类别ID交叉。 ds 有 1109383 行，ds_categories 有 2775 行。更具体地说，我想将类别名称链接到类别 ID。这个完整的数据库位于 kaggle：https://www.kaggle.com/sp1thas/book-depository-dataset/code

ds_categories

 category_id  category_name

1   1998    .Net Programming        
2   176     20th Century & Contemporary Classical Music     
3   3291    20th Century & Contemporary Classical Music     
4   2659    20th Century History: C 1900 To C 2000      
5   2661    21st Century History: From C 2000 -     
6   1992    2D Graphics: Games Programming

authors bestsellers.rank categories
1   [1]     49848       [214, 220, 237, 2646, 2647, 2659, 2660, 2679]   
2   [2, 3]  115215      [235, 3386] 
3   [4]     11732       [358, 2630, 360, 2632]  
4   [5, 6, 7, 8]114379  [377, 2978, 2980]   
5   [9]      98413      [2813, 2980]    
6   [10, 11]    90674   [1520, 1532]

我试过了，但没用：

join_cat <- merge(ds, ds_categories, by.x = "categories", by.y = "category_id", all.x = TRUE, all.y = FALSE)

【问题讨论】：

标签： r dplyr tidyverse

【解决方案1】：

您需要先执行一些数据清理，将每个categories 的数据放在单独的行中，然后执行连接。

library(dplyr)
library(tidyr)

ds %>%
  mutate(categories = gsub('\\[|\\]', '', categories)) %>%
  separate_rows(categories, sep = ',\\s*', convert = TRUE) %>%
  left_join(ds_categories, by = c('categories' = 'category_id'))

#   authors bestsellers.rank categories category_name                                      
#   <chr>              <dbl>      <int> <chr>                                              
# 1 [1]                49848        214 Biography: General                                 
# 2 [1]                49848        220 Biography: Historical, Political & Military        
# 3 [1]                49848        237 True War  & Combat Stories                         
# 4 [1]                49848       2646 Asian History                                      
# 5 [1]                49848       2647 Middle Eastern History                             
# 6 [1]                49848       2659 20th Century History: C 1900  To C 2000            
# 7 [1]                49848       2660 Postwar 20th Century History, From C 1945 To C 2000
# 8 [1]                49848       2679 Military History                                   
# 9 [2, 3]            115215        235 True Crime Biographies                             
#10 [2, 3]            115215       3386 True Crime Books

数据

ds_categories <- read.csv('categories.csv')
ds <- data.frame(authors = c('[1]', '[2, 3]'), 
                      bestsellers.rank = c(49848, 115215), 
                      categories = c('[214, 220, 237, 2646, 2647, 2659, 2660, 2679]', 
                                     '[235, 3386]'))

【讨论】：

它的作品@ronakshah ！谢谢！

【解决方案2】：

取消嵌套与多个类别相关的行后，加入变得更加容易：

library(tidyverse)

# create some example data
ds_categories <- tribble(
  ~category_id,  ~category_name,
  1, "cat A",
  2, "cat B",
  3, "cat C"
)

ds <- tribble(
  ~authors, ~categories,
  c(1,2), c(1,2),
  3, 1,
  4, c(1,2,3)
)

ds %>%
  unnest(authors) %>%
  unnest(categories) %>%
  rename(category_id = categories) %>%
  left_join(ds_categories)
#> Joining, by = "category_id"
#> # A tibble: 8 x 3
#>   authors category_id category_name
#>     <dbl>       <dbl> <chr>        
#> 1       1           1 cat A        
#> 2       1           2 cat B        
#> 3       2           1 cat A        
#> 4       2           2 cat B        
#> 5       3           1 cat A        
#> 6       4           1 cat A        
#> 7       4           2 cat B        
#> 8       4           3 cat C

^{由reprex package (v2.0.1) 于 2021 年 12 月 13 日创建}

尝试始终将您的表格标准化为 3NF，也就是使它们整洁。

【讨论】：