【问题标题】：Trying to merge multiple tsv files in R based on two common columns尝试基于两个公共列合并 R 中的多个 tsv 文件
【发布时间】：2020-05-29 09:52:45
【问题描述】：

我有许多不同的 tsv 文件，每个文件都有相同的列名和相同的行数。我想合并文件，但基于两个特定列，如下图所示（一个 tsv 文件的第一行示例）：

#chrom chromStart  chromEnd           gene_id score strand         name exonic_length num_reads num_reads_fw
1      1  100250296 100250441 ENSG00000201491.1     .      -     RNU4-75P           146         0            0
2      1  100257218 100257309 ENSG00000202254.1     .      +        Y_RNA            92         0            0
3      1  100295021 100295093 ENSG00000252226.1     .      +   AL451051.1            73         0            0

我的最终目标是： A) 合并所有 tsv 文件，但只保留两个特定列，即gene_id 和 num_reads

B) 合并后，转置结果数据帧，以便将 tsv 文件名作为列名， gene_id 作为行名，实际内容是数字列 num_reads。

所需最终输出的“人工”示例（来自另一个示例）将是以下数据框/矩阵：

列名是我的 tsv 文件的相对名称，数值是所有行的 num_reads 列，行名类似于我要分离的第二列的gene_id：

head(assay(rse_gene))
                   SRR2079883 SRR2079884 SRR2079882 SRR2079885 SRR2079881 SRR2079880
ENSG00000000003.14     168731     180764     153611     171413     178689     163379
ENSG00000000005.5        1035       2828       1200       3059        676       1146
ENSG00000000419.12      59444      56188      84757      57178     103568      87674
ENSG00000000457.13      89775      89363     105319      84121     108518     102589
ENSG00000000460.16      51868      55312     153095      58828     154572     147016
ENSG00000000938.12        539        606        516        407       1337        624

我最初尝试了以下方法：

library(readr)
df <- list.files(full.names=T)%>% 
  lapply(read_tsv)%>%
  bind_rows

另外：

library(tidyverse);library(data.table)

listGeneFiles <- list.files(".",pattern=".tsv",full.names = TRUE)

dt.gene <- map(listGeneFiles, ~fread(.x, select=c(4,8))) %>%
reduce(left_join)

但没有任何方法可以达到预期的效果。此外，一个潜在的问题可能是第一列有一个 # 字符，如您所见..

任何建议或想法将不胜感激！！

最好的，

Efstathios

【问题讨论】：

那么预期的结果是什么？所有文件都会像这样吗？如果您包含多个，会更容易理解

标签： r csv merge

【解决方案1】：

试试这个：

# function to read in a tsv and add the file name as a column
customized_read_tsv <- function(file){
    read_tsv(file) %>%
        mutate(fileName = file)
}

list.files(full.names = TRUE) %>% # list all the files
    lapply(customized_read_tsv) %>% # read them all in with our custom function
    reduce(bind_rows) %>% # stack them all on top of each other
    select(gene_id, fileName, num_reads) %>% # select the correct columns
    pivot_wider(names_from = fileName, values_from = num_reads) # and switch from "long format" to "wide format"

一个补充：这会读入 工作目录 中的所有文件，该目录通常还包含其他文件，例如您实际运行的 R 脚本或 Rproj 文件。我建议将 tsv 文件放在子目录中，然后执行 list.files(path = "~/sourceFiles", full.names = TRUE) 之类的操作。

【讨论】：