【问题标题】:Extract data from a nested list, and convert to a tidy dataframe从嵌套列表中提取数据,并转换为整洁的数据框
【发布时间】:2019-07-24 19:19:37
【问题描述】:

我有一个需要帮助的数据重新格式化问题!我从一个列表开始,我想把它变成一个“整洁”的数据框,我可以进一步分析。

我的列表结构如下所示:

str(wells, list.len = 3)  
    List of 96  
     $ A1 :List of 2  
      ..$ times : num [1:96] 0 900 1800 2700 3600 4500 5400 6300 7200 8100 ...  
      ..$ values: num [1:80] 0.0966 0.0928 0.0924 0.0931 0.0931 0.0939 0.0937 0.0938 0.0943 0.0949 ...  
      ..- attr(*, "name")= chr "A1"  
      ..- attr(*, "class")= chr "softermax.well"  
      ..- attr(*, "ID")= chr "1"  
      ..- attr(*, "row")= int 1  
      ..- attr(*, "col")= int 1  
     $ A2 :List of 2  
      ..$ times : num [1:96] 0 900 1800 2700 3600 4500 5400 6300 7200 8100 ...  
      ..$ values: num [1:80] 0.0945 0.0915 0.0912 0.0911 0.0913 0.0918 0.0921 0.0921 0.0923 0.0934 ...  
      ..- attr(*, "name")= chr "A2"  
      ..- attr(*, "class")= chr "softermax.well"  
      ..- attr(*, "ID")= chr "2"  
      ..- attr(*, "row")= int 1  
      ..- attr(*, "col")= int 2  
     $ A3 :List of 2  
      ..$ times : num [1:96] 0 900 1800 2700 3600 4500 5400 6300 7200 8100 ...  
      ..$ values: num [1:80] 0.0932 0.09 0.0898 0.0896 0.0898 0.0901 0.0903 0.0903 0.0911 0.0918 ...  
      ..- attr(*, "name")= chr "A3"  
      ..- attr(*, "class")= chr "softermax.well"  
      ..- attr(*, "ID")= chr "3"  
      ..- attr(*, "row")= int 1  
      ..- attr(*, "col")= int 3  

我希望生成的数据框包含三列,“名称”、“时间”和“值”。 “名称”应该是每个顶级列表条目的“名称”属性——“名称”的每个值在最终数据框中应该有 80 行,其中“时间”和“值”是前 80 个条目“times”和“values”子列表。 “times”的第 81 到 96 个条目是 NA,需要删除,以便“times”和“values”列表的长度相同。

我一直在玩 tidyverse 的咕噜声和地图。我可以提取一些我想要的片段,但无法弄清楚如何将它们全部放在一起。

我可以通过以下方式获取“姓名”列表: wellnames <- attributes(wells)

我可以使用purrr::map 提取每个子列表的“时间”和“值”,如下所示: x <- map(wells,[, c("times", "values")) 但无法将列表的结果列表转换为数据框,因为“时间”和“值”具有不同的长度(分别为 96 和 80,因为“时间”末尾有额外的 NA 值)。

我可以为第一个子列表提取所需的“时间”值: wells$A1$times[!is.na(wells$A1$times)] 但无法弄清楚如何使用 purrr 和带有 is.na 的 map 函数来为 96 个子列表中的每一个提取所需的“时间”值。

如果我可以在没有 NA 值的情况下获得“时间”,那么将这些片段转换为一个或多个数据帧并根据需要使用 dplyr 重塑/组合应该是相当简单的。

我知道这个问题必须有一个 tidyverse 解决方案;我只是还没有完全弄清楚处理嵌套和 NA 的语法。

这是前 3 个子列表的完整数据集:

dput(wells[1:3])  
structure(list(A1 = structure(list(times = c(0, 900, 1800, 2700, 
3600, 4500, 5400, 6300, 7200, 8100, 9000, 9900, 10800, 11700, 
12600, 13500, 14400, 15300, 16200, 17100, 18000, 18900, 19800, 
20700, 21600, 22500, 23400, 24300, 25200, 26100, 27000, 27900, 
28800, 29700, 30600, 31500, 32400, 33300, 34200, 35100, 36000, 
36900, 37800, 38700, 39600, 40500, 41400, 42300, 43200, 44100, 
45000, 45900, 46800, 47700, 48600, 49500, 50400, 51300, 52200, 
53100, 54000, 54900, 55800, 56700, 57600, 58500, 59400, 60300, 
61200, 62100, 63000, 63900, 64800, 65700, 66600, 67500, 68400, 
69300, 70200, 71100, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA), values = c(0.0966, 0.0928, 0.0924, 0.0931, 
0.0931, 0.0939, 0.0937, 0.0938, 0.0943, 0.0949, 0.0951, 0.096, 
0.0968, 0.098, 0.0991, 0.1004, 0.102, 0.1034, 0.1054, 0.1078, 
0.1103, 0.1132, 0.1161, 0.1196, 0.1234, 0.1279, 0.1329, 0.1381, 
0.144, 0.1505, 0.1574, 0.1648, 0.1732, 0.1819, 0.1912, 0.2018, 
0.2127, 0.232, 0.2436, 0.329, 0.4145, 0.3683, 0.4234, 0.5003, 
0.5291, 0.5463, 0.5472, 0.5664, 0.5649, 0.5618, 0.5487, 0.5494, 
0.5372, 0.5241, 0.4825, 0.5502, 0.544, 0.5415, 0.5319, 0.5234, 
0.5174, 0.5146, 0.5098, 0.4848, 0.3679, 0.3651, 0.3627, 0.3574, 
0.3686, 0.3577, 0.3689, 0.3528, 0.3584, 0.3573, 0.3471, 0.3571, 
0.3556, 0.3536, 0.3648, 0.3428)), .Names = c("times", "values"
), name = "A1", class = "softermax.well", ID = "1", row = 1L, col = 1L), 
    A2 = structure(list(times = c(0, 900, 1800, 2700, 3600, 4500, 
    5400, 6300, 7200, 8100, 9000, 9900, 10800, 11700, 12600, 
    13500, 14400, 15300, 16200, 17100, 18000, 18900, 19800, 20700, 
    21600, 22500, 23400, 24300, 25200, 26100, 27000, 27900, 28800, 
    29700, 30600, 31500, 32400, 33300, 34200, 35100, 36000, 36900, 
    37800, 38700, 39600, 40500, 41400, 42300, 43200, 44100, 45000, 
    45900, 46800, 47700, 48600, 49500, 50400, 51300, 52200, 53100, 
    54000, 54900, 55800, 56700, 57600, 58500, 59400, 60300, 61200, 
    62100, 63000, 63900, 64800, 65700, 66600, 67500, 68400, 69300, 
    70200, 71100, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA), values = c(0.0945, 0.0915, 0.0912, 0.0911, 
    0.0913, 0.0918, 0.0921, 0.0921, 0.0923, 0.0934, 0.094, 0.0949, 
    0.0958, 0.0965, 0.098, 0.0994, 0.101, 0.1028, 0.1054, 0.1079, 
    0.1108, 0.1138, 0.1173, 0.1219, 0.1261, 0.1313, 0.1366, 0.1431, 
    0.1497, 0.1572, 0.1657, 0.1742, 0.1846, 0.195, 0.2066, 0.2203, 
    0.2329, 0.2507, 0.3472, 0.3383, 0.2988, 0.5052, 0.5218, 0.5425, 
    0.4873, 0.45, 0.532, 0.5555, 0.5582, 0.5819, 0.5856, 0.5698, 
    0.5713, 0.5837, 0.5698, 0.5674, 0.5612, 0.562, 0.5605, 0.5498, 
    0.5597, 0.556, 0.5412, 0.5382, 0.5329, 0.5367, 0.5417, 0.525, 
    0.5205, 0.532, 0.5119, 0.5255, 0.5138, 0.523, 0.5128, 0.5227, 
    0.5114, 0.5244, 0.5193, 0.5089)), .Names = c("times", "values"
    ), name = "A2", class = "softermax.well", ID = "2", row = 1L, col = 2L), 
    A3 = structure(list(times = c(0, 900, 1800, 2700, 3600, 4500, 
    5400, 6300, 7200, 8100, 9000, 9900, 10800, 11700, 12600, 
    13500, 14400, 15300, 16200, 17100, 18000, 18900, 19800, 20700, 
    21600, 22500, 23400, 24300, 25200, 26100, 27000, 27900, 28800, 
    29700, 30600, 31500, 32400, 33300, 34200, 35100, 36000, 36900, 
    37800, 38700, 39600, 40500, 41400, 42300, 43200, 44100, 45000, 
    45900, 46800, 47700, 48600, 49500, 50400, 51300, 52200, 53100, 
    54000, 54900, 55800, 56700, 57600, 58500, 59400, 60300, 61200, 
    62100, 63000, 63900, 64800, 65700, 66600, 67500, 68400, 69300, 
    70200, 71100, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA), values = c(0.0932, 0.09, 0.0898, 0.0896, 
    0.0898, 0.0901, 0.0903, 0.0903, 0.0911, 0.0918, 0.0925, 0.0935, 
    0.0943, 0.0952, 0.0967, 0.0977, 0.1, 0.1018, 0.1041, 0.1067, 
    0.1092, 0.1156, 0.1151, 0.1193, 0.1238, 0.1284, 0.1334, 0.1402, 
    0.1464, 0.1533, 0.1614, 0.1698, 0.178, 0.1883, 0.1981, 0.2098, 
    0.2216, 0.2437, 0.3692, 0.4148, 0.4345, 0.4958, 0.5029, 0.4899, 
    0.5336, 0.5654, 0.547, 0.486, 0.5027, 0.5277, 0.4908, 0.5641, 
    0.5867, 0.5822, 0.5615, 0.5527, 0.5519, 0.5292, 0.3352, 0.3579, 
    0.3604, 0.3638, 0.366, 0.3787, 0.3737, 0.3645, 0.3674, 0.3794, 
    0.3589, 0.3981, 0.3361, 0.3508, 0.3217, 0.3196, 0.3176, 0.3645, 
    0.3532, 0.3528, 0.3267, 0.3473)), .Names = c("times", "values"
    ), name = "A3", class = "softermax.well", ID = "3", row = 1L, col = 3L)), .Names = c("A1", 
"A2", "A3"))

【问题讨论】:

    标签: r dataframe purrr


    【解决方案1】:

    您可以使用map_df 执行此操作,它会自动简化为数据框(tibble)。

    wells2 <- map_df(wells,
                     ~tibble(time = .$times[!is.na(.$times)],  #remove NAs to get lengths right
                             value = .$values),
                             .id = "name")                     #adds an id column
    

    【讨论】:

      【解决方案2】:
      library(dplyr); library(purrr)
      
      wells %>% 
        map(~tibble(time = na.omit(.x$times), value = na.omit(.x$values))) %>% 
        bind_rows(.id = "name")
      

      对于每个列表元素,使用从父元素的 timesvalues 元素中选择的列创建一个 tibble-dataframe。


      更一般地说,如果您想将函数应用于嵌套元素,请使用map_depth

      wells %>% 
        map_depth(2, na.omit) %>% 
        map(as_tibble) %>% 
        bind_rows(.id = "name")
      

      【讨论】:

        【解决方案3】:

        使用tidyverse的另一种可能性:

        library(tidyverse)
        
        enframe(na.omit(unlist(wells))) %>% 
          mutate(mrow = str_extract(name, '[[:digit:]]+$'),
                 mvar = gsub('A|\\.|[[:digit:]]+', '', name),
                 name = str_extract(name, '^A[[:digit:]]+')) %>% 
          spread(key = mvar, value = value) %>% 
          select(-mrow)
        
        #> # A tibble: 240 x 3
        #>    name  times values
        #>    <chr> <dbl>  <dbl>
        #>  1 A1        0 0.0966
        #>  2 A1     8100 0.0949
        #>  3 A1     9000 0.0951
        #>  4 A1     9900 0.096 
        #>  5 A1    10800 0.0968
        #>  6 A1    11700 0.098 
        #>  7 A1    12600 0.0991
        #>  8 A1    13500 0.100 
        #>  9 A1    14400 0.102 
        #> 10 A1    15300 0.103 
        #> # ... with 230 more rows
        

        【讨论】:

          【解决方案4】:

          data.table 呢?

          wells <- data.table::rbindlist(
            lapply(wells, function(x) lapply(x, `[`, 1:80)),
            idcol = 'name'
          )
          

          【讨论】:

            猜你喜欢
            • 2023-04-08
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 2020-06-22
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            相关资源
            最近更新 更多