根据 data.frame r 中另一列的值从列中提取信息答案

【问题标题】：extract info from a column based on value from another column in data.frame r根据 data.frame r 中另一列的值从列中提取信息
【发布时间】：2018-12-26 13:21:38
【问题描述】：

我有一个大文件 ~100k 行和 100 列，我想根据另一列创建提取四列的信息。有一个名为Caller 的列，该列告诉您哪些列.sample 将包含noSample 以外的信息。

我已经尝试过使用if and else if 语句，但有时会满足两个条件，编写所有可能的组合需要付出很多努力，我很确定有更好的方法来做到这一点

我的真实 data.frame 看起来像这样：

编辑

 Df <- data.frame(A = c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1"),
             B= c(10,12,13,14,15,16,17),
             Caller = c("A", "B", "C",  "D", "A,C", "A,B,C", "B,D"),
             A.sample = c("3xd|432", "noSample","noSample","noSample","1234|567|87sd","234|456|897a","noSample"),
             dummy1 = 1:7,
             B.sample = c("noSample", "456|789|asd", "noSample","noSample","noSample","674e|7892|123|432","bgcf|12er|567|zxs3|12ple"),
             dummy2 = 1:7,
             C.sample = c("noSample","noSample", "zxc|vbn|mn","noSample","gfd3|123|456|789","674e|7892|123","noSample" ),
             dummy3 = 1:7,
             D.sample = c("noSample","noSample", "noSample", "poi|uyh|gfrt|562", "noSample", "noSample", "567|zxs3|12ple"), stringsAsFactors=FALSE)

我想为每一行提取一个样本向量。这可以存储在列表或另一个 R 对象上。我将使用这些样本与一个 data.frame 进行匹配，其中每个样本都与一个进程相关联。

  My desired output would be

  >row1
  3xd|432 
  >row2
   456|789|asd
  >row3
  zxc|vbn|mn
  >row4
  poi|uyh|gfrt|562
  >row5
  [1]1234|567|87sd [2]gfd3|123|456|789
  >row6
  [1]234|456|897a [2]674e|7892|123|432  [3]674e|7892|123
  >row7
  [1]bgcf|12er|567|zxs3|12ple  [2]567|zxs3|12ple

我想要的输出不包括样本之间的管道 |，但我可以使用 strsplit 摆脱它

由于 data.frame 很大，因此速度至关重要。

【问题讨论】：

您似乎正试图从数据框中获取带状对角线。您可能希望将数据格式化为表格/矩阵，以便理解这一点。
@TimBiegeleisen，它并不总是完美的对角线，在某些情况下，一整列样本的所有值都可能是noSample
那个格式怎么样？试着给我们一个最小的问题。
对不起，如果我不明白你的意思，但我只想从带有noSample 的列中提取样本信息，并且该信息必须以某种方式按行索引
在输出向量中用[1]等表示样本有多重要？

标签： r dataframe

【解决方案1】：

这是一个可能的解决方案：

Df <- data.frame(A = c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1"),
                 B= c(10,12,13,14,15,16,17),
                 Caller = c("A", "B", "C",  "D", "A,C", "A,B,C", "B,D"),
                 A.sample = c("3xd|432", "noSample","noSample","noSample","1234|567|87sd","234|456|897a","noSample"),
                 B.sample = c("noSample", "456|789|asd", "noSample","noSample","noSample","674e|7892|123|432","bgcf|12er|567|zxs3|12ple"),
                 C.sample = c("noSample","noSample", "zxc|vbn|mn","noSample","gfd3|123|456|789","674e|7892|123","noSample" ),
                 D.sample = c("noSample","noSample", "noSample", "poi|uyh|gfrt|562", "noSample", "noSample", "567|zxs3|12ple"),
                 stringsAsFactors=FALSE)

#find names of columns
names<-substr(names(Df), 1, 1)
#Set unwanted names to NA
names[-c(4:ncol(Df))]<-NA

#create a regular expression by replacing the comma with the or |
reg<-gsub(",", "\\|", Df$Caller)

#find the column matches
columns<-sapply(reg, function(x){grep(x, names)})    

#extract the desired columns out into a list
lapply(seq_along(columns), function(x){Df[x,columns[[x]]]})

我在数据框定义中添加了stringsAsFactors=FALSE，以消除与因子级别相关的包袱。

【讨论】：

它工作得很好，但在我的真实数据中。设置列(A.sample, B.sample, C.sample, D.sample) 不是连续的，它们在位置c(8,10,12,14)，我不知道如何修复columns 步骤以获得正确的列，因为您使用了 +3 来获取正确的索引，对吧？
@user2380782，我进行了编辑以处理不连续的列，只需将数组中的 c(8, 10, 12, 14) 替换为 names[-c(...)]<-NA 行即可

【解决方案2】：

仅显示实现预期结果的多种可能方法中的一种。请注意，我使用与@Dave2e 相同的数据框，即我已将stringsAsFactors=F 添加到对data.frame 的调用中。

library(tidyverse)
out <- df %>% rowid_to_column() %>% # adding explicit row IDs
       gather(key, value, -rowid, -A, -B, -Caller) %>% # reshaping the dataframe
       filter(value != "noSample")

生成的数据框将如下所示：

out
   rowid    A  B Caller      key                    value
1      1 chr1 10      A A.sample                  3xd|432
2      5 chr1 15    A,C A.sample            1234|567|87sd
3      6 chr1 16  A,B,C A.sample             234|456|897a
4      2 chr1 12      B B.sample              456|789|asd
5      6 chr1 16  A,B,C B.sample        674e|7892|123|432
6      7 chr1 17    B,D B.sample bgcf|12er|567|zxs3|12ple
7      3 chr1 13      C C.sample               zxc|vbn|mn
8      5 chr1 15    A,C C.sample         gfd3|123|456|789
9      6 chr1 16  A,B,C C.sample            674e|7892|123
10     4 chr1 14      D D.sample         poi|uyh|gfrt|562
11     7 chr1 17    B,D D.sample           567|zxs3|12ple

现在我们可以简单地进行子集化以检索所需的结果：

out[out$rowid == 1,"value"]
[1] "3xd|432"
out[out$rowid == 5,"value"]
[1] "1234|567|87sd"    "gfd3|123|456|789"

【讨论】：