【问题标题】:R: Return column names for all matching values in a rowR:返回一行中所有匹配值的列名
【发布时间】:2017-03-29 15:16:16
【问题描述】:

我有一个数据框 (hits_map),其中包含针对每个基因(列)内结合位点的基因(行)列表。这些值表示每个基因中有多少个位点,NA 为 0。

这是一个小子集,因为实际的数据框要大得多:

         AscG Dan.4 IclR.3 MraZ.1
afaE      NA     1     NA      1
afaF      NA    NA     NA     NA
agn43.1    1    NA      1     NA
agn43.2    1    NA     NA     NA
agn43.3    1    NA     NA     NA
chuA      NA    NA     NA      1
csgA       1    NA     NA      1
csgB      NA    NA     NA     NA
csgC      NA    NA     NA     NA`

对于每一列,我想获得一个包含值的绑定站点/列名列表,然后我可以使用它从相应的数据框 nameseq 中提取行,以获取更多信息信息。

目前我使用以下方法逐行执行此操作,使用函数 remove_zero_cols 删除 0 的值,但我希望能够通过输入数据对每一行执行此操作。框架。

vec <- hits_map[row,]
vec <- remove_zero_cols(vec)
vec <- colnames(vec)
nameseq[nameseq$Name %in% vec,]

有什么建议可以解决这个问题吗?

【问题讨论】:

  • 您的问题不在此处,但在 Stack Overflow 上。我投票将其迁移到那里,无需重新询问。

标签: r


【解决方案1】:

一种方法是将数据框按行转换为单个向量,并根据您要查找的值创建一个逻辑向量,确保将FALSE 转换为NA。然后创建一个与逻辑向量长度相同的重复列名向量、子集并重新转换为矩阵:

> set.seed(1)
> DF = data.frame(first = sample(c(NA,1), 5, T), second = sample(c(NA,1), 5, T),
+                 third = sample(c(NA,1), 5, T), fourth = sample(c(NA,1), 5, T),
+                 fifth = sample(c(NA,1), 5, T))
> DF
  first second third fourth fifth
1    NA      1    NA     NA     1
2    NA      1    NA      1    NA
3     1      1     1      1     1
4     1      1    NA     NA    NA
5    NA     NA     1      1    NA
> DFvector = as.vector(t(DF))
> DFvector
 [1] NA  1 NA NA  1 NA  1 NA  1 NA  1  1  1  1  1  1  1 NA NA NA NA NA  1  1 NA
# Create a repeated vector of column names
> columnNames = rep(colnames(DF), times = nrow(DF))
> myNames = columnNames[as.logical(DFvector)]
> myNames[is.na(myNames)] = ""
> myNames
 [1] ""       "second" ""       ""       "fifth"  ""       "second" ""       "fourth" ""       "first" 
[12] "second" "third"  "fourth" "fifth"  "first"  "second" ""       ""       ""       ""       ""      
[23] "third"  "fourth" ""      
# Convert to matrix, by row
myMatrix = matrix(myNames, ncol = ncol(DF), byrow = T)
# Can group per row, by using assertr package
> library(assertr)
> library(stringr)
> concat = assertr::col_concat(myMatrix[], sep = " ")
> concat
[1] " second   fifth"                 " second  fourth "                "first second third fourth fifth"
[4] "first second   "                 "  third fourth "                
> noWS = trimws(concat)
> noWS
[1] "second   fifth"                  "second  fourth"                  "first second third fourth fifth"
[4] "first second"                    "third fourth"                   
> noS = gsub(pattern = "\\s+", replacement = " ", x = noWS)
> noS
[1] "second fifth"                    "second fourth"                   "first second third fourth fifth"
[4] "first second"                    "third fourth"                   
> stringr::str_split(noS, " ", simplify = T)
     [,1]     [,2]     [,3]    [,4]     [,5]   
[1,] "second" "fifth"  ""      ""       ""     
[2,] "second" "fourth" ""      ""       ""     
[3,] "first"  "second" "third" "fourth" "fifth"
[4,] "first"  "second" ""      ""       ""     
[5,] "third"  "fourth" ""      ""       "" 

现在您可以使用原始数据框中的相同行来获取每行对应的列名。 我希望有人可以发布data.table/dplyr 替代方案,因为如果要避免使用lapply,这将非常乏味。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2016-03-17
    • 1970-01-01
    • 1970-01-01
    • 2012-05-12
    • 1970-01-01
    • 2016-10-31
    • 2021-12-05
    • 1970-01-01
    相关资源
    最近更新 更多