【发布时间】:2020-09-14 15:42:34
【问题描述】:
我有一个包含以下数千行的 .gtf 文件:
sequence1 A transcript 21056 21562 1000 - . gene_id "STRG.3"; transcript_id "DABLNBNP_00019"; ref_gene_name "C_1"; cov "6.923077"; FPKM "28.676970"; TPM "100.721863";
sequence1 A transcript 22861 23949 1000 + . gene_id "STRG.12"; transcript_id "DABLNBNP_00021"; cov "0.456382"; FPKM "1.890439"; TPM "6.639771";
sequence1 B transcript 23990 24547 . + . transcript_id "DABLNBNP_00011"; ref_gene_name "AB"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
sequence1 B transcript 25725 26642 . + . transcript_id "DABLNBNP_00012"; ref_gene_name "BC"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
最后一列包含用分号分隔的信息。如何将最后一列拆分为单独的列(gene_id、transcript_id、ref_gene_name、cov、FPKM、TPM)。并非所有行都包含有关“gene_id”或“ref_gene_name”的信息。如果我只是用 R (tidyr) 中的单独函数划分列,则列将被移动:
# Load packages
library(tidyr)
# Make data frame
a = c("sequence1", "sequence1", "sequence1", "sequence1")
b = c("A", "A", "B", "B")
c = c("transcript", "transcript", "transcript", "transcript")
d = c(21056, 22861, 23990, 25725)
e = c(21562, 23949, 24547, 26642)
f = c(1000, 1000, ".", ".")
g = c("-", "+", "+", "+")
h = c(".", ".", ".", ".")
i = c("gene_id STRG.3; transcript_id DABLNBNP_00019; ref_gene_name C_1; cov 6.923077; FPKM 28.676970; TPM 100.721863;", "gene_id STRG.12; transcript_id DABLNBNP_00021; cov 0.456382; FPKM 1.890439; TPM 6.639771;", "transcript_id DABLNBNP_00011; ref_gene_name AB; cov 0.0; FPKM 0.000000; TPM 0.000000;", "transcript_id DABLNBNP_00012; ref_gene_name BC; cov 0.0; FPKM 0.000000'; TPM 0.000000;")
dataset <- data.frame(cbind(a,b,c,d,e,f,g,h,i))
# Split last column
dataset_split <- separate(dataset, i, into = c("Gene_id", "transcript_id",
"ref_gene_name", "cov",
"FPKM", "TPM"), sep=";")
有人知道如何解决这个问题吗?
非常感谢!
【问题讨论】:
标签: r data-cleaning