【问题标题】:How to 'Unstack' list in R when some of the entries in the list are not equally repeating like others in list?当列表中的某些条目不像列表中的其他条目一样重复时,如何在 R 中“取消堆叠”列表?
【发布时间】:2018-01-23 19:19:43
【问题描述】:

这个问题是之前一个问题 (Filter values from list in R) 的延伸。我有一长串类似于下面列出的列表。与所有其他名称相比,列表中的名称之一“issues.fields.customfield_10400”重复次数较少。检查此“名称”的值是否存在是我要处理的任务之一。 NULL 值完全没问题。

DF = structure(list(name = structure(c(7L, 3L, 1L, 6L, 4L, 2L, 5L, 
                                      7L, 3L, 1L, 6L, 4L, 2L, 5L, 7L, 3L, 1L, 6L, 4L, 5L, 7L, 3L, 1L, 
                                      6L, 4L, 5L), .Label = c("issues.fields.created", "issues.fields.customfield_10400", 
                                                              "issues.fields.issuetype.name", "issues.fields.status.name", 
                                                              "issues.fields.summary", "issues.fields.updated", "issues.key"
                                      ), class = "factor"), value = structure(c(18L, 13L, 4L, 4L, 11L, 
                                                                                7L, 10L, 17L, 14L, 3L, 6L, 11L, 7L, 9L, 16L, 13L, 2L, 2L, 11L, 
                                                                                8L, 15L, 14L, 1L, 5L, 11L, 12L), .Label = c("2017-05-05T13:09:12.381-0700", 
                                                                                                                            "2017-06-07T07:03:11.155-0700", "2017-07-26T11:15:03.074-0700", 
                                                                                                                            "2017-08-01T09:00:44.956-0700", "2017-08-14T13:47:21.612-0700", 
                                                                                                                            "2017-08-14T13:47:30.419-0700", "AA1234567", "Acquire replacement files from XYZ", 
                                                                                                                            "Add measurement ", "Ingest changed file location ", "Open", 
                                                                                                                            "Re-classify \"Generic Assays\" (n=24)", "Sub-task", "Task", 
                                                                                                                            "TEST-1030", "TEST-1192", "TEST-1357", "TEST-1358"), class = "factor")), .Names = c("name", 
                                                                                                                                                                                                                "value"), row.names = c(NA, 26L), class = "data.frame")

                              name                               value
1                       issues.key                           TEST-1358
2     issues.fields.issuetype.name                            Sub-task
3            issues.fields.created        2017-08-01T09:00:44.956-0700
4            issues.fields.updated        2017-08-01T09:00:44.956-0700
5        issues.fields.status.name                                Open
6  issues.fields.customfield_10400                           AA1234567
7            issues.fields.summary       Ingest changed file location 
8                       issues.key                           TEST-1357
9     issues.fields.issuetype.name                                Task
10           issues.fields.created        2017-07-26T11:15:03.074-0700
11           issues.fields.updated        2017-08-14T13:47:30.419-0700
12       issues.fields.status.name                                Open
13 issues.fields.customfield_10400                           AA1234567
14           issues.fields.summary                    Add measurement 
15                      issues.key                           TEST-1192
16    issues.fields.issuetype.name                            Sub-task
17           issues.fields.created        2017-06-07T07:03:11.155-0700
18           issues.fields.updated        2017-06-07T07:03:11.155-0700
19       issues.fields.status.name                                Open
20           issues.fields.summary  Acquire replacement files from XYZ
21                      issues.key                           TEST-1030
22    issues.fields.issuetype.name                                Task
23           issues.fields.created        2017-05-05T13:09:12.381-0700
24           issues.fields.updated        2017-08-14T13:47:21.612-0700
25       issues.fields.status.name                                Open
26           issues.fields.summary Re-classify "Generic Assays" (n=24)

当我取消堆叠列表时,我收到以下错误消息。

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
  arguments imply differing number of rows:

有人可以建议如何处理这种情况吗?

我需要创建如下所示的数据框。

res = structure(list(issues.fields.created = structure(c(4L, 3L, 2L, 
                                                   1L), .Label = c("2017-05-05T13:09:12.381-0700", "2017-06-07T07:03:11.155-0700", 
                                                                   "2017-07-26T11:15:03.074-0700", "2017-08-01T09:00:44.956-0700"
                                                   ), class = "factor"), issues.fields.issuetype.name = structure(c(1L, 
                                                                                                                    2L, 1L, 2L), .Label = c("Sub-task", "Task"), class = "factor"), 
               issues.fields.status.name = structure(c(1L, 1L, 1L, 1L), .Label = "Open", class = "factor"), 
               issues.fields.customfield_10400 = structure(c(2L, 2L, 1L, 
                                                             1L), .Label = c("", "AA1234567"), class = "factor"), issues.fields.summary = structure(c(3L, 
                                                                                                                                                      2L, 1L, 4L), .Label = c("Acquire replacement files from XYZ", 
                                                                                                                                                                              "Add measurement ", "Ingest changed file location", "Re-classify \"Generic Assays\" (n=24)"
                                                                                                                                                      ), class = "factor"), issues.fields.updated = structure(c(2L, 
                                                                                                                                                                                                                4L, 1L, 3L), .Label = c("2017-06-07T07:03:11.155-0700", "2017-08-01T09:00:44.956-0700", 
                                                                                                                                                                                                                                        "2017-08-14T13:47:21.612-0700", "2017-08-14T13:47:30.419-0700"
                                                                                                                                                                                                                ), class = "factor"), issues.key = structure(c(4L, 3L, 2L, 
                                                                                                                                                                                                                                                               1L), .Label = c("TEST-1030", "TEST-1192", "TEST-1357", "TEST-1358"
                                                                                                                                                                                                                                                               ), class = "factor")), .Names = c("issues.fields.created", 
                                                                                                                                                                                                                                                                                                 "issues.fields.issuetype.name", "issues.fields.status.name", 
                                                                                                                                                                                                                                                                                                 "issues.fields.customfield_10400", "issues.fields.summary", "issues.fields.updated", 
                                                                                                                                                                                                                                                                                                 "issues.key"), row.names = c(NA, 4L), class = "data.frame")

         issues.fields.created issues.fields.issuetype.name issues.fields.status.name
1 2017-08-01T09:00:44.956-0700                     Sub-task                      Open
2 2017-07-26T11:15:03.074-0700                         Task                      Open
3 2017-06-07T07:03:11.155-0700                     Sub-task                      Open
4 2017-05-05T13:09:12.381-0700                         Task                      Open
  issues.fields.customfield_10400               issues.fields.summary
1                       AA1234567        Ingest changed file location
2                       AA1234567                    Add measurement 
3                                  Acquire replacement files from XYZ
4                                 Re-classify "Generic Assays" (n=24)
         issues.fields.updated issues.key
1 2017-08-01T09:00:44.956-0700  TEST-1358
2 2017-08-14T13:47:30.419-0700  TEST-1357
3 2017-06-07T07:03:11.155-0700  TEST-1192
4 2017-08-14T13:47:21.612-0700  TEST-1030

【问题讨论】:

    标签: r


    【解决方案1】:

    使用标题中提到的unstack函数:

    us = unstack(DF, value ~ name)
    data.frame(lapply(us, `length<-`, max(lengths(us))))
    

    这给了

             issues.fields.created issues.fields.customfield_10400 issues.fields.issuetype.name issues.fields.status.name
    1 2017-08-01T09:00:44.956-0700                       AA1234567                     Sub-task                      Open
    2 2017-07-26T11:15:03.074-0700                       AA1234567                         Task                      Open
    3 2017-06-07T07:03:11.155-0700                            <NA>                     Sub-task                      Open
    4 2017-05-05T13:09:12.381-0700                            <NA>                         Task                      Open
                    issues.fields.summary        issues.fields.updated issues.key
    1       Ingest changed file location  2017-08-01T09:00:44.956-0700  TEST-1358
    2                    Add measurement  2017-08-14T13:47:30.419-0700  TEST-1357
    3  Acquire replacement files from XYZ 2017-06-07T07:03:11.155-0700  TEST-1192
    4 Re-classify "Generic Assays" (n=24) 2017-08-14T13:47:21.612-0700  TEST-1030
    

    缺失的值用NA——R 中的标准代码——而不是空格来填充。

    【讨论】:

      【解决方案2】:
      #Split `DF` by `name` into a list. Keep only the second column for each subgroup
      mylist = lapply(split(DF, DF$name), function(a) as.character(a[,2]))
      
      #Obtain the length of the subgroup in the list with most elements
      temp = max(lengths(mylist))
      
      #Subset all groups to the `temp`. `sapply` will simplify into matrix
      output = as.data.frame(sapply(mylist, function(a) a[1:temp]))
      

      【讨论】:

        【解决方案3】:

        这只是从“长”格式更改为“宽”格式。使用dplyrtidyr...

        library(dplyr)
        library(tidyr)
        df2 <- df %>% mutate(case=cumsum(name=="issues.key")) %>% 
                      spread(key=name, value=value) %>%
                      select(-case)
        
        df2
                 issues.fields.created issues.fields.customfield_10400 issues.fields.issuetype.name issues.fields.status.name               issues.fields.summary        issues.fields.updated issues.key
        1 2017-08-01T09:00:44.956-0700                       AA1234567                     Sub-task                      Open       Ingest changed file location  2017-08-01T09:00:44.956-0700  TEST-1358
        2 2017-07-26T11:15:03.074-0700                       AA1234567                         Task                      Open                    Add measurement  2017-08-14T13:47:30.419-0700  TEST-1357
        3 2017-06-07T07:03:11.155-0700                            <NA>                     Sub-task                      Open  Acquire replacement files from XYZ 2017-06-07T07:03:11.155-0700  TEST-1192
        4 2017-05-05T13:09:12.381-0700                            <NA>                         Task                      Open Re-classify "Generic Assays" (n=24) 2017-08-14T13:47:21.612-0700  TEST-1030
        

        【讨论】:

          【解决方案4】:

          使用data.table's(或reshape2's)dcast 函数,您可以执行以下操作:

          # create ID variable
          dat$id <- cumsum(grepl("TEST-", dat$value, fixed=TRUE))
          

          现在,按名称重塑 id

          library(data.table) # or library(reshape2)
          dcast(dat, id~name, value.var="value", fill=NA)
          

          这会在下面返回所需的结果。

            id        issues.fields.created issues.fields.customfield_10400 issues.fields.issuetype.name
          1  1 2017-08-01T09:00:44.956-0700                       AA1234567                     Sub-task
          2  2 2017-07-26T11:15:03.074-0700                       AA1234567                         Task
          3  3 2017-06-07T07:03:11.155-0700                            <NA>                     Sub-task
          4  4 2017-05-05T13:09:12.381-0700                            <NA>                         Task
            issues.fields.status.name               issues.fields.summary        issues.fields.updated issues.key
          1                      Open       Ingest changed file location  2017-08-01T09:00:44.956-0700  TEST-1358
          2                      Open                    Add measurement  2017-08-14T13:47:30.419-0700  TEST-1357
          3                      Open  Acquire replacement files from XYZ 2017-06-07T07:03:11.155-0700  TEST-1192
          4                      Open Re-classify "Generic Assays" (n=24) 2017-08-14T13:47:21.612-0700  TEST-1030
          

          【讨论】:

          • 令人惊讶的是,这似乎有效:dcast(dat, cumsum(name == "issues.key") ~ name)
          • 这很好。我还没有看到在其中一个函数中使用过复杂的公式。我想如果这些事情在lm 调用中起作用,它们可能应该在这种情况下起作用。
          • Imo 的策略适用于给定的示例。然而,对于一长串的数千个条目,弗兰克的策略很顺利。
          猜你喜欢
          • 1970-01-01
          • 2018-08-22
          • 2015-06-13
          • 2013-09-22
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2020-10-21
          相关资源
          最近更新 更多