【问题标题】:merging almost identical row in a data frame in R在R中的数据框中合并几乎相同的行
【发布时间】:2021-01-25 11:33:47
【问题描述】:

我有一个大型临床数据数据框(154 个变量的 882 个 obs)。在这个数据框中,有 441 名独特的患者,重复两次,除了一列。所以表格的虚拟版本如下所示:

id age gender tumour type treatment
1 76 F colon adeno radiotherapy
1 76 F colon adeno chemotherapy
2 70 M colon adeno radiotherapy
2 70 M colon adeno chemotherapy
3 68 M colon adeno radiotherapy
3 68 M colon adeno chemotherapy

我想把这张表压缩成这样:

id age gender tumour type treatment_a treatment_b
1 76 F colon adeno radiotherapy chemotherapy
2 70 M colon adeno radiotherapy chemotherapy
3 68 M colon adeno radiotherapy chemotherapy

我在网上查看并尝试使用类似问题的解决方案,例如。 sapplygroup_bysummarisedistinct 但我似乎无法正确使用语法。 我完全是新手,这似乎是一个简单的问题。提前致谢。

【问题讨论】:

标签: r dataframe data-manipulation


【解决方案1】:

使用dcastdata.table 选项

dcast(
  setDT(df)[,q := paste0(treatment,"_",head(letters,.N)),id:type],
  ...~ q, 
  value.var = "treatment")

给予

   id age gender tumour  type chemotherapy_b radiotherapy_a
1:  1  76      F  colon adeno   chemotherapy   radiotherapy
2:  2  70      M  colon adeno   chemotherapy   radiotherapy
3:  3  68      M  colon adeno   chemotherapy   radiotherapy

数据

> dput(df)
structure(list(id = c(1L, 1L, 2L, 2L, 3L, 3L), age = c(76L, 76L, 
70L, 70L, 68L, 68L), gender = c("F", "F", "M", "M", "M", "M"), 
    tumour = c("colon", "colon", "colon", "colon", "colon", "colon"
    ), type = c("adeno", "adeno", "adeno", "adeno", "adeno", 
    "adeno"), treatment = c("radiotherapy", "chemotherapy", "radiotherapy", 
    "chemotherapy", "radiotherapy", "chemotherapy")), class = "data.frame", row.names = c(NA, 
-6L))

【讨论】:

  • 这些软件包非常适合我正在尝试做的事情,不知道为什么我找不到它们。谢谢!
【解决方案2】:

一般来说,如果您可以使用几行数据添加可重现的示例,例如使用dput(),那么帮助你会更容易——在这种情况下,从你的表中复制也可以。

您可以尝试使用tidyr 包中的pivot_wider()。假设您的数据被称为df 并且是一个小标题:

我们首先使用pivot_wider(),然后重命名列名以获得您正在寻找的内容

df %>% 
  pivot_wider(id_cols = c(id, age, gender, tumour),values_from = treatment, names_from = treatment) %>%
  rename(treatment_a = radiotherapy, treatment_b=chemotherapy)

# A tibble: 3 x 6
     id   age gender tumour treatment_a   treatment_b  
  <int> <int> <chr>  <chr>  <chr>        <chr>       
1     1    76 F      colon  radiotherapy chemotherapy
2     2    70 M      colon  radiotherapy chemotherapy
3     3    68 M      colon  radiotherapy chemotherapy

【讨论】:

  • 感谢您的建议,以后我一定会使用dput() 提问。答案也很完美。
【解决方案3】:
df=read.table(text="
id  age gender  tumour  type    treatment
1   76  F   colon   adeno   radiotherapy
1   76  F   colon   adeno   chemotherapy
2   70  M   colon   adeno   radiotherapy
2   70  M   colon   adeno   chemotherapy
3   68  M   colon   adeno   radiotherapy
3   68  M   colon   adeno   chemotherapy",h=T)

df$idontknow=ifelse(df$treatment=="radiotherapy","treatment_a","treatment_b")

library(reshape2)
dcast(df,id+age+gender+tumour+type~idontknow,value.var="treatment")

   id age gender tumour  type  treatment_a  treatment_b
1:  1  76      F  colon adeno radiotherapy chemotherapy
2:  2  70      M  colon adeno radiotherapy chemotherapy
3:  3  68      M  colon adeno radiotherapy chemotherapy

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2019-04-07
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-03-11
    • 2021-03-30
    • 2019-07-16
    相关资源
    最近更新 更多