如何在 R 中读取 vcf 文件答案

【问题标题】：How to read vcf file in R如何在 R 中读取 vcf 文件
【发布时间】：2015-12-07 10:37:45
【问题描述】：

我有这个VCF format file，我想在 R 中读取这个文件。但是，这个文件包含一些我想跳过的冗余行。我想得到类似于行以匹配#CHROM 的行开头的结果。

这是我尝试过的：

chromo1<-try(scan(myfile.vcf,what=character(),n=5000,sep="\n",skip=0,fill=TRUE,na.strings="",quote="\"")) ## find the start of the vcf file
skip.lines<-grep("^#CHROM",chromo1)


column.labels<-read.delim(myfile.vcf,header=F,nrows=1,skip=(skip.lines-1),sep="\t",fill=TRUE,stringsAsFactors=FALSE,na.strings="",quote="\"")
num.vars<-dim(column.labels)[2]

我的文件.vcf

    #not wanted line
    #unnecessary line
    #junk line
    #CHROM  POS     ID      REF     ALT
    11      33443   3        A       T
    12      33445   5        A       G

结果

    #CHROM  POS     ID      REF     ALT
    11      33443   3        A       T
    12      33445   5        A       G

【问题讨论】：

使用测序包怎么样？如果你用谷歌搜索“read vcf R”，就会有一些
Bioconductor 有几个 VCF 阅读器。
@RichardScriven vcfreader 不适合我的情况。我只想跳过这些行并获取制表符分隔的表格。
Extract sample data from VCF files的可能重复

标签： r bioinformatics genetics vcf-variant-call-format

【解决方案1】：

也许这对你有好处：

# read two times the vcf file, first for the columns names, second for the data
tmp_vcf<-readLines("test.vcf")
tmp_vcf_data<-read.table("test.vcf", stringsAsFactors = FALSE)

# filter for the columns names
tmp_vcf<-tmp_vcf[-(grep("#CHROM",tmp_vcf)+1):-(length(tmp_vcf))]
vcf_names<-unlist(strsplit(tmp_vcf[length(tmp_vcf)],"\t"))
names(tmp_vcf_data)<-vcf_names

p.s.：如果你有多个 vcf 文件，那么你应该使用 lapply 函数。

最好，罗伯特

【讨论】：

很好的答案，但你总是在变量名中使用点吗？我觉得它令人困惑（特别是如果你也知道 python），更喜欢更多的下划线。我想这是一个品味问题，干杯。
@RicardoGuerreiro 点在 R 中的变量名中是惯用的。广泛使用且完全可以接受。

【解决方案2】：

data.table::fread 按预期读取，参见示例：

library(data.table)

#try this example vcf from GitHub
vcf <- fread("https://raw.githubusercontent.com/vcflib/vcflib/master/samples/sample.vcf")

#or if the file is local:
vcf <- fread("path/to/my/vcf/sample.vcf")

我们也可以使用vcfR 包，查看链接中的手册。

【讨论】：