【问题标题】:how to select 1 of several variables with the same prefix?如何选择具有相同前缀的多个变量中的 1 个?
【发布时间】:2019-04-03 14:32:15
【问题描述】:

继续我之前的问题How do I return multiple columns without consider Na values and group by other columns name in R?

Mexico_01 <- c(1,2,5,1,NA,1)
Mexico_02 <- c(3,NA,2,0,4,1)
Argentina_01 <- c(2,1,5,2,NA,2)
Argentina_02 <- c(2,3,NA,2,2,2)
Italy<- c(NA,10,10,10,NA,10)
Spain_01 <- c(2,NA,4,6,8,11)
Spain_02 <- c(3,4,NA,11,11,11)
England <- c(5,NA,10,NA,NA,12)
Germany <- c(1,NA,NA,NA,NA,10)
Data_Risk = data.frame( Mexico_01, Mexico_02, Argentina_01, Argentina_02, 
Italy, Spain_01, Spain_02, England, Germany)

Data_Risk <- as.data.table(Data_Risk)
library(data.table)
library(magrittr)
all_variable <- as.data.table(which(!is.na(Data_Risk), arr.ind = T))
all_variable [, .(colnm = names(Data_Risk)[col], col = paste0('var', 

order(col))) , by = row] %>%  dcast(row ~ col, value.var = 'colnm')

给予

row      var1         var2         var3         var4     var5     var6     
var7
1:   1 Mexico_01    Mexico_02 Argentina_01 Argentina_02 Spain_01 Spain_02  
England

2:   2 Mexico_01 Argentina_01 Argentina_02        Italy Spain_02     <NA>     
<NA>

3:   3 Mexico_01    Mexico_02 Argentina_01        Italy Spain_01  England     
<NA>

4:   4 Mexico_01    Mexico_02 Argentina_01 Argentina_02    Italy Spain_01 

Spain_02

5:   5 Mexico_02 Argentina_02     Spain_01     Spain_02     <NA>     <NA>     
 <NA>

6:   6 Mexico_01    Mexico_02 Argentina_01 Argentina_02    Italy Spain_01 
 Spain_02

 var8          var9
 1: Germany    <NA>
 2:    <NA>    <NA>
 3:    <NA>    <NA>
 4:    <NA>    <NA>
 5:    <NA>    <NA>
 6: England Germany

对于这种情况,我只需要考虑具有相同前缀的所有变量中的单个变量,例如:而不是 mexico_01 或 mexico_02 只选择墨西哥。

所以决赛桌必须是这样的:

var1           var2          var3       var4     var5    var6
mexico    argentina       england    germany     null    null
mexico    argentina         italy       null     null    null 
mexico    argentina         italy      spain  england    null
mexico    argentina         italy      spain     null    null
spain      null             null       null      null    null
mexico    argentina         italy      spain england  germany

【问题讨论】:

    标签: r join


    【解决方案1】:

    我们可以用tstrsplit拆分列,根据'row'、'V1'列获取duplicated ids,将'V1'中的这些元素分配给NA,然后执行dcast

    out[, c("V1", "V2") := tstrsplit(colnm, "_")]
    i1 <- out[, .I[duplicated(.SD)], .SDcols = c('row',  'V1')]
    out[i1, V1 := NA_character_]
    out[, V1 := V1[order(is.na(V1))], row]
    dcast(out, row ~ col, value.var = "V1")[, row := NULL][]
    

    数据

    out <-  all_variable [, .(colnm = names(Data_Risk)[col], 
             col = paste0('var',  order(col))) , by = row]
    

    【讨论】:

    • 感谢您的支持,但您知道如何只考虑时间这个变量,我的意思是,它不会重复变量作为我上面显示的最终表格。
    • @Lectxx7 你能详细说明一下does not repeat the variable。在你的决赛桌中,我发现 england、'germany' 等重复
    • 我的意思是,例如,每行只考虑一个变量;第一行:
    • @Lectxx7 我知道了,但是您确定预期的输出是基于示例吗,因为我发现某些值不匹配
    • 非常感谢,当你在上面放出[, c("V1", "V2") := tstrsplit(colnm, "_")] 时,我们不需要考虑las 2数字
    猜你喜欢
    • 2021-10-02
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多