【问题标题】:Combining columns, update columns based on other df, fill NAs合并列,根据其他df更新列,填充NA
【发布时间】:2018-02-27 20:01:57
【问题描述】:

一开始我想指出,我在 SO 上找到了多种解决方案,但没有一个符合我的期望。

我必须去 DF 的:

1.

E                           F              G        H
chr1_100203723_100203724    NA             NA       NA
chr1_100212951_100212952    rs760764323    A,G,     0.000008,0.999992,
chr1_10032235_10032236      NA             NA       NA
chr1_100327060_100327061    NA             NA       NA
chr1_100346889_100346890    NA             NA       NA
chr1_100347237_100347238    rs749372877    C,G,T,   0.000008,0.000008,0.999983,
chr1_100357190_100357191    NA             NA       NA
chr1_100358057_100358058    NA             NA       NA
chr2_182852606_182852607    NA             NA       NA
chr2_202492077_202492078    NA             NA       NA
chr2_203760838_203760839    NA             NA       NA
chr2_215976351_215976352    NA             NA       NA
chr2_220354644_220354645    NA             NA       NA
chr2_234749403_234749404    NA             NA       NA
chr2_11802110_11802111      NA             NA       NA
chr2_31167747_31167748      NA             NA       NA

2.

E                           F               G       H
chr1_100203723_100203724    NA              NA      NA
chr1_100212951_100212952    NA              NA      NA
chr1_10032235_10032236      NA              NA      NA
chr1_100327060_100327061    NA              NA      NA
chr1_100346889_100346890    NA              NA      NA
chr1_100347237_100347238    NA              NA      NA
chr1_100357190_100357191    NA              NA      NA
chr1_100358057_100358058    NA              NA      NA
chr2_182852606_182852607    rs773426830     C,T,    0.999967,0.000033,
chr2_202492077_202492078    rs750583431     C,G,    0.000013,0.999987,
chr2_203760838_203760839    NA              NA      NA
chr2_215976351_215976352    rs113648834     C,T,    0.999934,0.000066,
chr2_220354644_220354645    NA              NA      NA
chr2_234749403_234749404    NA              NA      NA
chr2_11802110_11802111      rs371327070     A,G,    0.000044,0.999956,
chr2_31167747_31167748      rs201375957     A,C,T,  0.000008,0.999887,0.000105,

期望的输出:

E                           F               G       H
chr1_100203723_100203724    NA              NA      NA
chr1_100212951_100212952    rs760764323     A,G,    0.000008,0.999992,
chr1_10032235_10032236      NA              NA      NA
chr1_100327060_100327061    NA              NA      NA
chr1_100346889_100346890    NA              NA      NA
chr1_100347237_100347238    rs749372877     C,G,T,  0.000008,0.000008,0.999983,
chr1_100357190_100357191    NA              NA      NA
chr1_100358057_100358058    NA              NA      NA
chr2_182852606_182852607    rs773426830     C,T,    0.999967,0.000033,
chr2_202492077_202492078    rs750583431     C,G,    0.000013,0.999987,
chr2_203760838_203760839    NA              NA      NA
chr2_215976351_215976352    rs113648834     C,T,    0.999934,0.000066,
chr2_220354644_220354645    NA              NA      NA
chr2_234749403_234749404    NA              NA      NA
chr2_11802110_11802111      rs371327070     A,G,    0.000044,0.999956,
chr2_31167747_31167748      rs201375957     A,C,T,  0.000008,0.999887,0.000105,

如您所见,DF1 由 DF2 列 F、G、H 更新,其中列 E 是我的唯一索引。我试图做merge(),但这个函数并没有更新我的行,它只是将 DF2 的列添加到 DF1。我还尝试使用data.tabletidyverse 进行更新,我的行已经更新,但其他行转到NAs... 最后我决定用嵌套ifelse() 做简单的lapply(),但是我没有知道如何同时更新所有三列,更何况这对于我每个 DF 中超过 50000 行的数据来说太慢了......

到目前为止我做了什么:

DF1$F <- sapply(1:nrow(DF1), function(i) ifelse(DF1[i,1]==DF2[i,1] & is.na(DF1[i,1]), DF2[i,1], DF[i,1]))

【问题讨论】:

    标签: r merge dplyr


    【解决方案1】:

    你可以在基础 R 中做到这一点:

    as.data.frame(Map(function(x,y) ifelse(is.na(x),y,x),DF1,DF2))
    

    使用库 purrr,您可以拥有更漂亮更紧凑的形式(请参阅 Soto 的答案,了解更紧凑的 dplyr):

    library(purrr)
    map2_df(DF1,DF2,~ifelse(is.na(.x),.y,.x))
    

    在这两种情况下(从技术上讲,第一种情况是data.frame,第二种情况是tibble):

    输出

                                E           F      G                           H
    1    chr1_100203723_100203724        <NA>   <NA>                        <NA>
    2    chr1_100212951_100212952 rs760764323   A,G,          0.000008,0.999992,
    3    chr1_10032235_10032236        <NA>   <NA>                        <NA>
    4    chr1_100327060_100327061        <NA>   <NA>                        <NA>
    5    chr1_100346889_100346890        <NA>   <NA>                        <NA>
    6    chr1_100347237_100347238 rs749372877 C,G,T, 0.000008,0.000008,0.999983,
    7    chr1_100357190_100357191        <NA>   <NA>                        <NA>
    8    chr1_100358057_100358058        <NA>   <NA>                        <NA>
    9    chr2_182852606_182852607 rs773426830   C,T,          0.999967,0.000033,
    10   chr2_202492077_202492078 rs750583431   C,G,          0.000013,0.999987,
    11   chr2_203760838_203760839        <NA>   <NA>                        <NA>
    12   chr2_215976351_215976352 rs113648834   C,T,          0.999934,0.000066,
    13   chr2_220354644_220354645        <NA>   <NA>                        <NA>
    14   chr2_234749403_234749404        <NA>   <NA>                        <NA>
    15   chr2_11802110_11802111 rs371327070   A,G,          0.000044,0.999956,
    16   chr2_31167747_31167748 rs201375957 A,C,T, 0.000008,0.999887,0.000105,
    

    数据

    DF1 <- read.table(text="E                           F              G        H
    chr1_100203723_100203724    NA             NA       NA
    chr1_100212951_100212952    rs760764323    A,G,     0.000008,0.999992,
    chr1_10032235_10032236      NA             NA       NA
    chr1_100327060_100327061    NA             NA       NA
    chr1_100346889_100346890    NA             NA       NA
    chr1_100347237_100347238    rs749372877    C,G,T,   0.000008,0.000008,0.999983,
    chr1_100357190_100357191    NA             NA       NA
    chr1_100358057_100358058    NA             NA       NA
    chr2_182852606_182852607    NA             NA       NA
    chr2_202492077_202492078    NA             NA       NA
    chr2_203760838_203760839    NA             NA       NA
    chr2_215976351_215976352    NA             NA       NA
    chr2_220354644_220354645    NA             NA       NA
    chr2_234749403_234749404    NA             NA       NA
    chr2_11802110_11802111      NA             NA       NA
    chr2_31167747_31167748      NA             NA       NA",header=T,stringsAsFactors=F)
    
    
    DF2 <- read.table(text="E                           F               G       H
    chr1_100203723_100203724    NA              NA      NA
    chr1_100212951_100212952    NA              NA      NA
    chr1_10032235_10032236      NA              NA      NA
    chr1_100327060_100327061    NA              NA      NA
    chr1_100346889_100346890    NA              NA      NA
    chr1_100347237_100347238    NA              NA      NA
    chr1_100357190_100357191    NA              NA      NA
    chr1_100358057_100358058    NA              NA      NA
    chr2_182852606_182852607    rs773426830     C,T,    0.999967,0.000033,
    chr2_202492077_202492078    rs750583431     C,G,    0.000013,0.999987,
    chr2_203760838_203760839    NA              NA      NA
    chr2_215976351_215976352    rs113648834     C,T,    0.999934,0.000066,
    chr2_220354644_220354645    NA              NA      NA
    chr2_234749403_234749404    NA              NA      NA
    chr2_11802110_11802111      rs371327070     A,G,    0.000044,0.999956,
    chr2_31167747_31167748      rs201375957     A,C,T,  0.000008,0.999887,0.000105,",header=T,stringsAsFactors=F)
    

    【讨论】:

    • 我一直想知道Map()mapply()有什么区别。你能解释一下吗?
    • ?Map 你可以阅读Map is a simple wrapper to mapply which does not attempt to simplify the result。在我的示例中,mapply 将返回 matrixMap 返回一个列表,我将其转换为 data.frame
    【解决方案2】:

    来自dplyr 的函数coalesce 正是这样做的。我确信我们可以使用purrr 函数来映射 2 个数据帧,但这里有一个使用 base R mapply

    DF1[-1] <- mapply(dplyr::coalesce, DF1[-1], DF2[-1])
    

    给出,

                             E           F      G                           H
    1  chr1_100203723_100203724        <NA>   <NA>                        <NA>
    2  chr1_100212951_100212952 rs760764323   A,G,          0.000008,0.999992,
    3    chr1_10032235_10032236        <NA>   <NA>                        <NA>
    4  chr1_100327060_100327061        <NA>   <NA>                        <NA>
    5  chr1_100346889_100346890        <NA>   <NA>                        <NA>
    6  chr1_100347237_100347238 rs749372877 C,G,T, 0.000008,0.000008,0.999983,
    7  chr1_100357190_100357191        <NA>   <NA>                        <NA>
    8  chr1_100358057_100358058        <NA>   <NA>                        <NA>
    9  chr2_182852606_182852607 rs773426830   C,T,          0.999967,0.000033,
    10 chr2_202492077_202492078 rs750583431   C,G,          0.000013,0.999987,
    11 chr2_203760838_203760839        <NA>   <NA>                        <NA>
    12 chr2_215976351_215976352 rs113648834   C,T,          0.999934,0.000066,
    13 chr2_220354644_220354645        <NA>   <NA>                        <NA>
    14 chr2_234749403_234749404        <NA>   <NA>                        <NA>
    15   chr2_11802110_11802111 rs371327070   A,G,          0.000044,0.999956,
    16   chr2_31167747_31167748 rs201375957 A,C,T, 0.000008,0.999887,0.000105,
    

    注意:正如@Moody_Mudskipper 所提到的,purrr 版本会在不更改DF1DF2 的情况下生成新的数据帧,

    library(purrr)
    
    map2_df(DF1,DF2,dplyr::coalesce)
    

    【讨论】:

    • 整洁,purrr 将只是 map2_df(DF1,DF2,coalesce) 并且您避免了获得 data.frame 的技巧并且可以保持 DF1 完整
    • @Moody_Mudskipper 啊……谢谢。我在做map2_df(DF1, DF2, ~coalesce)
    • map2_df(DF1, DF2, ~coalesce(.x,.y)) 然后 :)
    • @Moody_Mudskipper 呵呵,是的,我刚刚想通了 :)
    • 如您所见,有 2 条染色体(数据的简单部分),我将它们全部组合在一起。换句话说,我将 df1 和 df2 结合起来,然后我想将 df3、df4 添加到文件 df1(df2 已经更新了行)。我可以这样做并且不会提供任何数据错误吗?此外,考虑到所有数据集,E 列没有重复项。我只是问,因为我担心某些行可能会有一些不需要的更新。
    【解决方案3】:

    另一种天真的方法是使用paste0

    > df1 <- data.frame(E = c('A','B','C'), F=c('0.9,1',NA,NA), G=c(NA,'0.98,0.34',NA), H=c(NA,'0.98,0.34',NA), stringsAsFactors = F)
    > df2 <- data.frame(E = c('A','B','C'), F=c(NA,'1,3',NA), G=c(NA,NA,'5,6,7'), H=c(NA,NA,NA), stringsAsFactors = F)
    
    
    
        > df1[is.na(df1)] <- ''
        > df2[is.na(df2)] <- ''
        > 
        > mapply(paste, df1[-1], df2[-1])
         F        G            H           
    [1,] "0.9,1 " " "          " "         
    [2,] " 1,3"   "0.98,0.34 " "0.98,0.34 "
    [3,] " "      " 5,6,7"     " "         
    

    根据Sotos 的建议和mapply 更新

    【讨论】:

    • 不是和分类数据合并吗?
    • 如果你有 100 列,这会变得很可怕
    • 是的,谢谢我明白了,删除答案更好吗?
    • 你可以做类似mapply(paste, df1[-1], df2[-1])
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2021-12-10
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-02-04
    • 2021-06-19
    相关资源
    最近更新 更多