在R中的数据框中合并几乎相同的行答案

【问题标题】：merging almost identical row in a data frame in R在R中的数据框中合并几乎相同的行
【发布时间】：2021-01-25 11:33:47
【问题描述】：

我有一个大型临床数据数据框（154 个变量的 882 个 obs）。在这个数据框中，有 441 名独特的患者，重复两次，除了一列。所以表格的虚拟版本如下所示：

id	age	gender	tumour	type	treatment
1	76	F	colon	adeno	radiotherapy
1	76	F	colon	adeno	chemotherapy
2	70	M	colon	adeno	radiotherapy
2	70	M	colon	adeno	chemotherapy
3	68	M	colon	adeno	radiotherapy
3	68	M	colon	adeno	chemotherapy

我想把这张表压缩成这样：

id	age	gender	tumour	type	treatment_a	treatment_b
1	76	F	colon	adeno	radiotherapy	chemotherapy
2	70	M	colon	adeno	radiotherapy	chemotherapy
3	68	M	colon	adeno	radiotherapy	chemotherapy

我在网上查看并尝试使用类似问题的解决方案，例如。 sapply、group_by、summarise 和 distinct 但我似乎无法正确使用语法。我完全是新手，这似乎是一个简单的问题。提前致谢。

【问题讨论】：

您可以查看dplyr.tidyverse.org/reference/join.html

标签： r dataframe data-manipulation

【解决方案1】：

使用dcast 的data.table 选项

dcast(
  setDT(df)[,q := paste0(treatment,"_",head(letters,.N)),id:type],
  ...~ q, 
  value.var = "treatment")

给予

   id age gender tumour  type chemotherapy_b radiotherapy_a
1:  1  76      F  colon adeno   chemotherapy   radiotherapy
2:  2  70      M  colon adeno   chemotherapy   radiotherapy
3:  3  68      M  colon adeno   chemotherapy   radiotherapy

数据

> dput(df)
structure(list(id = c(1L, 1L, 2L, 2L, 3L, 3L), age = c(76L, 76L, 
70L, 70L, 68L, 68L), gender = c("F", "F", "M", "M", "M", "M"), 
    tumour = c("colon", "colon", "colon", "colon", "colon", "colon"
    ), type = c("adeno", "adeno", "adeno", "adeno", "adeno", 
    "adeno"), treatment = c("radiotherapy", "chemotherapy", "radiotherapy", 
    "chemotherapy", "radiotherapy", "chemotherapy")), class = "data.frame", row.names = c(NA, 
-6L))

【讨论】：

这些软件包非常适合我正在尝试做的事情，不知道为什么我找不到它们。谢谢！

【解决方案2】：

一般来说，如果您可以使用几行数据添加可重现的示例，例如使用dput()，那么帮助你会更容易——在这种情况下，从你的表中复制也可以。

您可以尝试使用tidyr 包中的pivot_wider()。假设您的数据被称为df 并且是一个小标题：

我们首先使用pivot_wider()，然后重命名列名以获得您正在寻找的内容

df %>% 
  pivot_wider(id_cols = c(id, age, gender, tumour),values_from = treatment, names_from = treatment) %>%
  rename(treatment_a = radiotherapy, treatment_b=chemotherapy)

# A tibble: 3 x 6
     id   age gender tumour treatment_a   treatment_b  
  <int> <int> <chr>  <chr>  <chr>        <chr>       
1     1    76 F      colon  radiotherapy chemotherapy
2     2    70 M      colon  radiotherapy chemotherapy
3     3    68 M      colon  radiotherapy chemotherapy

【讨论】：

感谢您的建议，以后我一定会使用dput() 提问。答案也很完美。

【解决方案3】：

df=read.table(text="
id  age gender  tumour  type    treatment
1   76  F   colon   adeno   radiotherapy
1   76  F   colon   adeno   chemotherapy
2   70  M   colon   adeno   radiotherapy
2   70  M   colon   adeno   chemotherapy
3   68  M   colon   adeno   radiotherapy
3   68  M   colon   adeno   chemotherapy",h=T)

df$idontknow=ifelse(df$treatment=="radiotherapy","treatment_a","treatment_b")

library(reshape2)
dcast(df,id+age+gender+tumour+type~idontknow,value.var="treatment")

   id age gender tumour  type  treatment_a  treatment_b
1:  1  76      F  colon adeno radiotherapy chemotherapy
2:  2  70      M  colon adeno radiotherapy chemotherapy
3:  3  68      M  colon adeno radiotherapy chemotherapy

【讨论】：