使用具有重复测量数据的 R 的 pivot_wider 或类似函数答案

【问题标题】：Using pivot_wider or similar function with R with repeat measurement data使用具有重复测量数据的 R 的 pivot_wider 或类似函数
【发布时间】：2021-04-14 22:38:04
【问题描述】：

我有一个患者数据框，格式为每张胸部 X 光片一行。我的列包括胸部 X 光片的特定测量值、胸部 X 光片的日期，然后还有几个对于给定患者相同的附加列（如最终结果）。

例如：

+--------+------------+----------+------------+-------------+-----+-------+---------+
| pat_id | index_date | cxr_date | delta_date | cxr_measure | age | admit | outcome |
+--------+------------+----------+------------+-------------+-----+-------+---------+
|      1 | 1/2/2020   | 1/2/2020 |          0 |         0.1 |  55 |     1 |       0 |
|      1 | 1/2/2020   | 1/3/2020 |          1 |         0.3 |  55 |     1 |       0 |
|      1 | 1/2/2020   | 1/3/2020 |          1 |         0.5 |  55 |     1 |       0 |
|      2 | 2/1/2020   | 2/2/2020 |          1 |         0.2 |  59 |     0 |       0 |
|      2 | 2/1/2020   | 2/3/2020 |          2 |         0.9 |  59 |     0 |       0 |
|      3 | 1/6/2020   | 1/6/2020 |          0 |         0.7 |  66 |     1 |       1 |
+--------+------------+----------+------------+-------------+-----+-------+---------+

我想重新格式化表格，以便每位患者一行。我认为我的结束表应该如下所示，每个变量都变成：cxr_measure_# 其中# 是delta_date。在真实的数据集中，我会有很多这样的列（# 的范围从 -5 到 +30）。如果在同一个 delta_date 上有两个行/值，理想情况下我会取平均值。

+--------+------------+----------------+---------------+---------------+--------------+-----+-------+---------+
| pat_id | index_date | first_cxr_date | cxr_measure_0 | cxr_measure_1 | cxr_measure_2 | age | admit | outcome |
+--------+------------+----------------+---------------+---------------+--------------+-----+-------+---------+
|      1 | 1/2/2020   | 1/2/2020       | 0.1           | 0.4           | NA           |  55 |     1 |       0 |
|      2 | 2/1/2020   | 2/2/2020       | NA            | 0.2           | 0.9          |  59 |     0 |       0 |
|      3 | 1/6/2020   | 1/6/2020       | 0.7           | NA            | NA           |  66 |     1 |       1 |
+--------+------------+----------------+---------------+---------------+--------------+-----+-------+---------+

有没有一种简单的方法可以在这两个表之间进行基本重塑？我已经玩了一些 pivot_longer 和 pivot_wider，但不确定如何（1）处理在变量名中获取 delta_date 以及（2）如果有两个重叠的日期如何取平均值。也很好奇这是否更容易在 python 中完成（大部分数据管理使用 pandas，但随后在 R 中进行了一些额外的数据清理和分析）。

【问题讨论】：

必须与dput(head(data)) 共享您的可重现数据片段，以便其他人可以使用它来更有效地帮助您。
@AnoushiravanR 这里共享的数据是完全可重现的。例如，将第一个表格复制到剪贴板，然后运行 read.table(text=readClipboard(), sep="|", fill = TRUE, comment.char = "+", head = TRUE, strip.white = TRUE)
@Onyambu 很抱歉这是我的错误。我以前从未使用过此代码来阅读此类表格。我非常感谢这个宝贵的教训。非常感谢。
我会记住以后使用 dput(head(data)) 如果它使事情变得更容易（或者包括 Onyambu 的函数供参考！）我的实际数据要混乱得多，所以我有使用将制表符分隔的数据或 Excel 复制/粘贴转换为上述 ascii 格式的站点创建了一个具有代表性的数据集。

标签： r pivot-table reshape

【解决方案1】：

要扩展@Dave2e 响应，您可以使用group_by 然后min 通过pat_id 获取first_cxr_date，这可以让您编写一个简洁的功能解决方案。

library(tibble)
library(dplyr)
library(tidyr)

df <- 
tribble( 
~pat_id,  ~index_date,  ~cxr_date,  ~delta_date,  ~cxr_measure,  ~age,  ~admit,  ~outcome, 
        1,  '1/2/2020',  '1/2/2020',          0,          0.1,   55,      1,        0, 
        1,  '1/2/2020',   '1/3/2020',           1,          0.3,   55,      1,        0, 
        1,  '1/2/2020',  '1/3/2020',          1,          0.5,   55,      1,        0, 
        2,  '2/1/2020',   '2/2/2020',           1,          0.2,   59,      0,        0, 
        2,  '2/1/2020',  '2/3/2020',          2,          0.9,   59,      0,        0, 
        3,  '1/6/2020',   '1/6/2020',           0,          0.7,   66,      1,        1)

df %>% 
  group_by(pat_id) %>% mutate(first_cxr_date = min(cxr_date)) %>% ungroup() %>% # set first_cxr_date as min of group by pat_id
  pivot_wider(id_cols = -c(delta_date, cxr_measure, cxr_date) 
              , names_from = delta_date # column names from delta_date
              , values_from = cxr_measure
              , names_prefix = 'cxr_measure_' # paste string to column names
              , values_fn = mean # combine with mean
              )

# A tibble: 3 x 9
  pat_id index_date   age admit outcome first_cxr_date cxr_measure_0 cxr_measure_1 cxr_measure_2
   <dbl> <chr>      <dbl> <dbl>   <dbl> <chr>                  <dbl>         <dbl>         <dbl>
1      1 1/2/2020      55     1       0 1/2/2020                 0.1           0.4          NA  
2      2 2/1/2020      59     0       0 2/2/2020                NA             0.2           0.9
3      3 1/6/2020      66     1       1 1/6/2020                 0.7          NA            NA

【讨论】：

Dave2e 和这个答案效果很好，这个在格式上更干净一些，但除此之外，两者都很棒！我没有意识到的一件事是我的实际数据有 30 个其他列（其中一些列因 pat_id 组中的行而异），所以我必须修改 id_cols 以包含组内不同的列，或者首先使用 dplyr 的 select()对于感兴趣的列，然后使用 pat_id 合并相关数据。

【解决方案2】：

这是一种混合方法，使用 pivot_wider 计算 car_measures 和 dplyr 的均值来汇总函数以确定第一个 cxr_date。

df<- structure(list(pat_id = c(1L, 1L, 1L, 2L, 2L, 3L), 
                    index_date = c("1/2/2020",  "1/2/2020", "1/2/2020", "2/1/2020", "2/1/2020", "1/6/2020"), 
                    cxr_date = c("1/2/2020", "1/3/2020", "1/3/2020", "2/2/2020",  "2/3/2020", "1/6/2020"), 
                    delta_date = c(0L, 1L, 1L, 1L, 2L, 0L), 
                    cxr_measure = c(0.1, 0.3, 0.5, 0.2, 0.9, 0.7), 
                    age = c(55L,55L, 55L, 59L, 59L, 66L), 
                    admit = c(1L, 1L, 1L, 0L, 0L, 1L), 
                    outcome = c(0L, 0L, 0L, 0L, 0L, 1L)), class = "data.frame", row.names = c(NA, -6L))

library(tidyr)
library(dplyr)

answer <-pivot_wider(df, id_cols = -c("delta_date", "cxr_measure", "cxr_date"), 
            names_from = "delta_date", 
            values_from = c("cxr_measure"),
            values_fn = list(cxr_measure = mean),
            names_glue ='cxr_measure_{delta_date}') 

 firstdate <-df %>% group_by(pat_id) %>% summarize(first_cxr_date=min(as.Date(cxr_date, "%m/%d/%Y")))
 
answer <- left_join(answer, firstdate)
Joining, by = "pat_id"
# A tibble: 3 x 9
  pat_id index_date   age admit outcome cxr_measure_0 cxr_measure_1 cxr_measure_2 first_cxr_date
   <int>       <chr>   <int> <int>   <int>         <dbl>         <dbl>         <dbl>    <date>        
1      1    1/2/2020      55     1       0           0.1           0.4          NA   2020-01-02    
2      2    2/1/2020      59     0       0          NA             0.2           0.9 2020-02-02    
3      3    1/6/2020      66     1       1           0.7          NA            NA   2020-01-06

我确信有一种方法可以将所有这些组合到一个函数调用中，但有时丑陋只是更快。

【讨论】：

【解决方案3】：

特别感谢亲爱的@Onyambu 先生，他今天教会了我一个宝贵的观点。

您也可以使用以下解决方案。请注意.value，当需要从数据中创建多个列名时，它尤其适用于pivot_longer。这里它告诉pivot_wider，名称的一部分实际上是我们从中获取值的列的名称。

library(dplyr)
library(tidyr)


df %>%
  group_by(pat_id) %>%
  mutate(id = row_number()) %>%
  pivot_wider(names_from = delta_date, values_from = cxr_measure, 
              names_glue = "{.value}_{delta_date}") %>%
  mutate(across(cxr_measure_0:cxr_measure_2, ~ mean(.x, na.rm = TRUE))) %>%
  select(-id) %>%
  slice_head(n = 1)


# A tibble: 3 x 9
# Groups:   pat_id [3]
  pat_id index_date cxr_date   age admit outcome cxr_measure_0 cxr_measure_1 cxr_measure_2
   <int> <chr>      <chr>    <int> <int>   <int>         <dbl>         <dbl>         <dbl>
1      1 1/2/2020   1/2/2020    55     1       0           0.1           0.4         NaN  
2      2 2/1/2020   2/2/2020    59     0       0         NaN             0.2           0.9
3      3 1/6/2020   1/6/2020    66     1       1           0.7         NaN           NaN

【讨论】：