【问题标题】:Rectangling nested lists with different names/indecies具有不同名称/索引的矩形嵌套列表
【发布时间】:2021-04-26 18:00:48
【问题描述】:

我的数据输入来自 JSON 数据,列表中的名称是来自 JSON 的键/值对的键。所以它看起来像这样:

# Dummy data
doc1 <- list(type = "HTML",
             garbage = "blahblah",
             `1 - 28` = list(food = "pizza",
                             birthdate = "12-31-89",
                             name = "Jill"),
             `3 - 36` = list(pet = "gerbil",
                             gender = "female"))
doc2 <- list(type = "XLS",
             `2 - 2` = list(hour = "now",
                             profession = "Engineer"),
             `3 - 36` = list(name = "Fred",
                             age = "36"))
input <- list(doc1 = doc1, doc2 = doc2)

我想将数据“矩形化”以便于分析,所以它看起来像这样:

# A tibble: 9 x 5
  doc   type  location column     value   
  <chr> <chr> <chr>    <chr>      <chr>   
1 doc1  HTML  1 - 28   food       pizza   
2 doc1  HTML  1 - 28   birthdate  12-31-89
3 doc1  HTML  1 - 28   name       Jill    
4 doc1  HTML  3 - 36   pet        gerbil  
5 doc1  HTML  3 - 36   gender     female  
6 doc2  XLS   2 - 2    hour       now     
7 doc2  XLS   2 - 2    profession Engineer
8 doc2  XLS   3 - 36   name       Fred    
9 doc2  XLS   3 - 36   age        36   

复杂之处在于

  1. 嵌套词条有不同的索引,有时有些列我根本不需要(例如,garbage
  2. 较低的嵌套都有不同的名称

我有一个 for-loop 循环遍历文档并提取所有适当的值,但这需要相当长的时间处理大文件。我发现purrr 包中的map 函数可用于提取某些列(请参阅this tutorial)。但是当我不知道列名或索引时,我似乎无法让 map 工作。

# Work so far
input %>% {
  tibble(
    doc = names(.),
    type = map(., "type")
  )
} %>%
  unnest(cols = c(type))

我觉得thisvignette 有一把钥匙。

【问题讨论】:

    标签: r tidyr purrr


    【解决方案1】:

    这不是最通用的解决方案,但也许这会给你一些关于你的完整数据集的想法。我使用的步骤是

    • data.frame 在单个 doc 上创建一个混乱的数据框
    • 删除不需要的列
    • pivot_longer,在上一期分开
    • mutate... 字符串改回-
    library(tidyverse)
    
    # Function to clean a single doc
    doc_unnest <- function(doc, unwanted_cols){
      data.frame(doc) %>% 
        select(-contains(unwanted_cols)) %>%
        pivot_longer(-c(type),
                     names_prefix = 'X',
                     names_sep =  "\\.(?=[^\\.]+$)",
                     names_to = c('location', 'column')) %>%
        mutate(location = str_replace(location, '\\...', ' - '))
    }
    
    
    # Apply to both docs in input
    input %>%
      map_dfr(doc_unnest, .id = 'doc', unwanted_cols = 'garbage')
    
    
    #------------------
    # A tibble: 9 x 5
      doc   type  location column     value   
      <chr> <chr> <chr>    <chr>      <chr>   
    1 doc1  HTML  1 - 28   food       pizza   
    2 doc1  HTML  1 - 28   birthdate  12-31-89
    3 doc1  HTML  1 - 28   name       Jill    
    4 doc1  HTML  3 - 36   pet        gerbil  
    5 doc1  HTML  3 - 36   gender     female  
    6 doc2  XLS   2 - 2    hour       now     
    7 doc2  XLS   2 - 2    profession Engineer
    8 doc2  XLS   3 - 36   name       Fred    
    9 doc2  XLS   3 - 36   age        36  
    

    【讨论】:

      【解决方案2】:
      library(dplyr)
      library(tibblify)
      library(tidyr)
      
      input %>%
        tibblify %>%
        Reduce(unnest_longer, 3:7, .) %>%
        pivot_longer(3:7, names_to = "location", values_to = "column")
      

      给予:

      # A tibble: 50 x 5
         type  garbage  `2 - 2_id` location  column
         <chr> <chr>    <chr>      <chr>     <chr> 
       1 HTML  blahblah <NA>       1 - 28    pizza 
       2 HTML  blahblah <NA>       1 - 28_id food  
       3 HTML  blahblah <NA>       3 - 36    gerbil
       4 HTML  blahblah <NA>       3 - 36_id pet   
       5 HTML  blahblah <NA>       2 - 2     <NA>  
       6 HTML  blahblah <NA>       1 - 28    pizza 
       7 HTML  blahblah <NA>       1 - 28_id food  
       8 HTML  blahblah <NA>       3 - 36    female
       9 HTML  blahblah <NA>       3 - 36_id gender
      10 HTML  blahblah <NA>       2 - 2     <NA>  
      # ... with 40 more rows
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2021-11-26
        • 1970-01-01
        • 2022-11-24
        • 2020-07-08
        • 1970-01-01
        • 2016-06-21
        • 1970-01-01
        • 2012-09-14
        相关资源
        最近更新 更多