【问题标题】:Reshaping multiple groups of columns in a data frame from wide to long将数据框中的多组列从宽重塑为长
【发布时间】:2017-08-08 07:08:51
【问题描述】:

我正在处理空气质量数据。我尝试使用melt 函数将数据框从宽变为长。这是数据:Elev 代表ElevationObs 代表observationUS3, DK1, DE1 是模型,其中lmul 代表第一和第三分位数。

Elev      Obs       lm       ul      US3       lm       ul      DK1       lm       ul
1    0 37.74289 34.33422 41.27840 38.82037 35.35241 42.30042 49.31111 45.00134 53.90968
2  100 38.14076 34.71842 41.36560 39.82727 36.49086 43.22209 50.46545 45.79068 55.44664
3  250 39.31056 35.98180 42.50011 40.94909 37.70768 44.40232 50.79818 45.76405 55.54795
4  500 41.03098 37.78005 44.02544 42.54909 39.25627 45.72927 51.24182 46.76091 55.88568
5  750 43.57307 40.52575 46.92804 43.48000 40.55918 46.62914 51.90364 47.40586 56.37514
       DE1       lm       ul
1 41.15185 37.81824 44.62509
2 40.89455 37.38491 44.34759
3 40.93455 37.33400 44.32573
4 41.26727 37.90150 44.68568
5 43.04545 40.04541 46.12386

我用过

 melt(f,id.vars=c("Elev", "lm","um"),measure.vars=c("US3", "DK1", "DE1","Obs" ))

我得到了

Elev       lm       ul      variable    value
   0 34.33422 41.27840           US3 38.82037
 100 34.71842 41.36560           US3 39.82727
 250 35.98180 42.50011           US3 40.94909
 500 37.78005 44.02544           US3 42.54909
 750 40.52575 46.92804           US3 43.48000
   0 34.33422 41.27840           DK1 49.31111
 100 34.71842 41.36560           DK1 50.46545
 250 35.98180 42.50011           DK1 50.79818
 500 37.78005 44.02544           DK1 51.24182
 750 40.52575 46.92804           DK1 51.90364
   0 34.33422 41.27840           DE1 41.15185
 100 34.71842 41.36560           DE1 40.89455
 250 35.98180 42.50011           DE1 40.93455
 500 37.78005 44.02544           DE1 41.26727
 750 40.52575 46.92804           DE1 43.04545
   0 34.33422 41.27840           Obs 37.74289
 100 34.71842 41.36560           Obs 38.14076
 250 35.98180 42.50011           Obs 39.31056
 500 37.78005 44.02544           Obs 41.03098
 750 40.52575 46.92804           Obs 43.57307

可以清楚地看到lmul 的值在每个海拔高度重复。如何在不重复这些值的情况下获得长格式? 我的预期结果是:

Elev    lm      ul      variable  value
  0 35.35241 42.30042      US3 38.82037
100 36.49086 43.22209      US3 39.82727
250 37.70768 44.40232      US3 40.94909
500 39.25627 45.72927      US3 42.54909
750 40.55918 46.62914      US3 43.48000
  0 45.00134 53.90968      DK1 49.31111
100 45.79068 55.44664      DK1 50.46545
250 45.76405 55.54795      DK1 50.79818
500 46.76091 55.88568      DK1 51.24182
750 47.40586 56.37514      DK1 51.90364
  0 37.81824 44.62509      DE1 41.15185
100 37.38491 44.34759      DE1 40.89455
250 37.33400 44.32573      DE1 40.93455
500 37.90150 44.68568      DE1 41.26727
750 40.04541 46.12386      DE1 43.04545
  0 34.33422 41.27840      Obs 37.74289
100 34.71842 41.36560      Obs 38.14076
250 35.98180 42.50011      Obs 39.31056
500 37.78005 44.02544      Obs 41.03098
750 40.52575 46.92804      Obs 43.57307

【问题讨论】:

  • 你能举一个你想看的格式的例子吗?
  • 您希望输出是什么样的?你想丢弃那些变量吗?只包括一次然后在下面有NA?还有什么?
  • 您将 lmul 指定为 id 变量 - id.vars。您有 20 个唯一数据点和 5 个 ID。根据定义,它们必须重复。
  • 这就是“从宽到长”的工作原理。如果您不想看到Elevlmul,您可以使用例如 dplyr::select 在熔化前将它们移除。如果您需要它们,请不要担心:您的程序中的下一步将处理它。
  • library(tidyverse); map(0:3, ~df[c(1, .x * 3 + 2:4)]) %>% map_df(~gather(.x, var, val, -Elev, -lm, -ul))

标签: r dataframe reshape reshape2 melt


【解决方案1】:

如果您使用 data.table 并将您的名字命名为: Elev,Obs_va,obs_lm,obs_ul,US3_va,US3_lm,US3_ul,DK1_va,DK1_lm,DK1_ul,DE1_va,DE1_lm,DE1_ul。

然后这段代码以一种非常通用的方式产生预期的结果。

temp <- melt(temp, id.vars=c("Elev"))
temp[, `:=`(var = sub("_..$", '', variable),  measure = 
          sub('.*_', '', variable), variable = NULL)]  
dcast( temp[measure!="va",],   ... ~ measure, value.var='value' )  

您也可以手动传递参数。 或者只是将 data.table 或 data.frame 手动拆分并粘贴到块中。

这里你有另一个更简单的解决方案:

temp2 <- melt(temp, measure.vars=patterns("lm$","ul$"), 
   value.name = c("lm","ul"))[,c("Elev","variable","lm","ul")]
temp2[,"variable"] <- sub("_va","",grep("_va",names(temp), 
   value=T))[temp2$variable]

其中 temp 是您的原始 data.table。

【讨论】:

    【解决方案2】:

    data.table 的最新版本允许melt multiple columns simultaneously

    另一个困难是数据框包含具有相同名称的列。感谢patterns() 函数,不需要事先重命名列。

    library(data.table) # version 1.10.4 used here
    
    # create vector of the names of data groups - in the order they appear in the DF !
    dg_names <- c("Obs", "US3", "DK1", "DE1")
    
    # coerce DF to data.table and melt using the patterns() function to identify columns
    molten <- melt(setDT(DF), 
                   measure.vars = patterns(paste(dg_names, collapse = "|"), "lm", "ul"), 
                   value.name = c("value", "lm", "ul"))
    
    # rename variable column to something meaningful
    molten[, variable := factor(variable, labels = dg_names)]
    

    尽管列和行的顺序不同,但结果符合 OP 的预期:

    molten
    #    Elev variable    value       lm       ul
    # 1:    0      Obs 37.74289 34.33422 41.27840
    # 2:  100      Obs 38.14076 34.71842 41.36560
    # 3:  250      Obs 39.31056 35.98180 42.50011
    # 4:  500      Obs 41.03098 37.78005 44.02544
    # 5:  750      Obs 43.57307 40.52575 46.92804
    # 6:    0      US3 38.82037 35.35241 42.30042
    # 7:  100      US3 39.82727 36.49086 43.22209
    # 8:  250      US3 40.94909 37.70768 44.40232
    # 9:  500      US3 42.54909 39.25627 45.72927
    #10:  750      US3 43.48000 40.55918 46.62914
    #11:    0      DK1 49.31111 45.00134 53.90968
    #12:  100      DK1 50.46545 45.79068 55.44664
    #13:  250      DK1 50.79818 45.76405 55.54795
    #14:  500      DK1 51.24182 46.76091 55.88568
    #15:  750      DK1 51.90364 47.40586 56.37514
    #16:    0      DE1 41.15185 37.81824 44.62509
    #17:  100      DE1 40.89455 37.38491 44.34759
    #18:  250      DE1 40.93455 37.33400 44.32573
    #19:  500      DE1 41.26727 37.90150 44.68568
    #20:  750      DE1 43.04545 40.04541 46.12386
    

    数据

    DF <- structure(list(Elev = c(0L, 100L, 250L, 500L, 750L), Obs = c(37.74289, 
    38.14076, 39.31056, 41.03098, 43.57307), lm = c(34.33422, 34.71842, 
    35.9818, 37.78005, 40.52575), ul = c(41.2784, 41.3656, 42.50011, 
    44.02544, 46.92804), US3 = c(38.82037, 39.82727, 40.94909, 42.54909, 
    43.48), lm = c(35.35241, 36.49086, 37.70768, 39.25627, 40.55918
    ), ul = c(42.30042, 43.22209, 44.40232, 45.72927, 46.62914), 
        DK1 = c(49.31111, 50.46545, 50.79818, 51.24182, 51.90364), 
        lm = c(45.00134, 45.79068, 45.76405, 46.76091, 47.40586), 
        ul = c(53.90968, 55.44664, 55.54795, 55.88568, 56.37514), 
        DE1 = c(41.15185, 40.89455, 40.93455, 41.26727, 43.04545), 
        lm = c(37.81824, 37.38491, 37.334, 37.9015, 40.04541), ul = c(44.62509, 
        44.34759, 44.32573, 44.68568, 46.12386)), .Names = c("Elev", 
    "Obs", "lm", "ul", "US3", "lm", "ul", "DK1", "lm", "ul", "DE1", 
    "lm", "ul"), row.names = c(NA, -5L), class = "data.frame")
    

    【讨论】:

    • @G1124E,请您提供一些反馈,无论 skan 和我的答案是否满足您的要求,或者您是否遗漏了什么。谢谢。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-06-05
    • 2018-05-08
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多