【问题标题】:How to extract all information into xls or txt file如何将所有信息提取到 xls 或 txt 文件中
【发布时间】:2015-05-24 01:18:26
【问题描述】:

我想在 xls 中提取给定样本的所有信息 例如

library(GEOquery)
gpl <- getGEO("GPL16791")
data <- gpl@header$sample_id
gps <- getGEO(data[1])
str(gps)

如下所示

Formal class 'GSM' [package "GEOquery"] with 2 slots
  ..@ dataTable:Formal class 'GEODataTable' [package "GEOquery"] with 2 slots
  .. .. ..@ columns:'data.frame':   0 obs. of  0 variables
  .. .. ..@ table  :'data.frame':   0 obs. of  0 variables
  ..@ header   :List of 36
  .. ..$ channel_count          : chr "1"
  .. ..$ characteristics_ch1    : chr "cell type: Induced endothelial cells from cultured foreskin fibroblast cells (Stegment)"
  .. ..$ contact_address        : chr "3333 Burnet Ave"
  .. ..$ contact_city           : chr "Cincinnati"
  .. ..$ contact_country        : chr "USA"
  .. ..$ contact_department     : chr "Biomedical Informatics"
  .. ..$ contact_email          : chr "Rebekah.Karns@cchmc.org"
  .. ..$ contact_institute      : chr "Cincinnati Children's Hospital Medical Center"
  .. ..$ contact_laboratory     : chr "Bruce Aronow, PhD"
  .. ..$ contact_name           : chr "Rebekah,,Karns"
  .. ..$ contact_state          : chr "OH"
  .. ..$ contact_zip/postal_code: chr "45276"
  .. ..$ data_processing        : chr [1:4] "Trimmed sequences were generated as fastq outputs and analyzed based on the TopHat/Cufflinks pipeline based on reference annota"| __truncated__ "Gene-level expression was normalized and baselined to the 80th percentile of that sample's overall expression in GeneSpring v7."| __truncated__ "Genome_build: GRCh37/hg19" "Supplementary_files_format_and_content: Each sample has a corresponding .txt file with normalized FPKM"
  .. ..$ data_row_count         : chr "0"
  .. ..$ description            : chr "iECa"
  .. ..$ extract_protocol_ch1   : chr [1:2] "Using RNeasy Mini Kit (Qiagen), total RNA was extracted and quantitative polymerase chain reaction was performed using Taqman g"| __truncated__ "RNA-Seq–based expression analysis was carried out using RNA samples converted into individual cDNA libraries using Illumina (Sa"| __truncated__
  .. ..$ geo_accession          : chr "GSM1098572"
  .. ..$ growth_protocol_ch1    : chr "Fibroblasts were treated with Poly I:C (30ng/ml) and the medium changed to DMEM with 7.5% FBS and 7.5% knockout serum replaceme"| __truncated__
  .. ..$ instrument_model       : chr "Illumina HiSeq 2500"
  .. ..$ last_update_date       : chr "Apr 18 2013"
  .. ..$ library_selection      : chr "cDNA"
  .. ..$ library_source         : chr "transcriptomic"
  .. ..$ library_strategy       : chr "RNA-Seq"
  .. ..$ molecule_ch1           : chr "total RNA"
  .. ..$ organism_ch1           : chr "Homo sapiens"
  .. ..$ platform_id            : chr "GPL16791"
  .. ..$ relation               : chr [1:2] "SRA: http://www.ncbi.nlm.nih.gov/sra?term=SRX249507" "BioSample: http://www.ncbi.nlm.nih.gov/biosample/SAMN01978505"
  .. ..$ series_id              : chr "GSE45176"
  .. ..$ source_name_ch1        : chr "Induced endothelial cell"
  .. ..$ status                 : chr "Public on Apr 14 2013"
  .. ..$ submission_date        : chr "Mar 14 2013"
  .. ..$ supplementary_file_1   : chr "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1098nnn/GSM1098572/suppl/GSM1098572_iECa_Processed.txt.gz"
  .. ..$ supplementary_file_2   : chr "ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX249/SRX249507"
  .. ..$ taxid_ch1              : chr "9606"
  .. ..$ title                  : chr "iEC: Rep1"
  .. ..$ type                   : chr "SRA"

我想要一个 txt 或 xls 格式的输出,其中每一行都是来自“数据”的样本,并且包含列中的所有这些信息,例如

   channel_count    characteristics_ch1                contact_address .....
1   "1"           "cell type: Induced endothelial cells       "3333 Burnet Ave"
2
.
.
.
until length of data

【问题讨论】:

    标签: r bioinformatics bioconductor


    【解决方案1】:

    现在,当标头缺少变量时,此功能也可以使用。我知道循环不是很优雅,但它在我的测试期间有效。

    gpl <- getGEO("GPL18448")
    data <- gpl@header$sample_id
    
    getGpsInfo <- function(x){
          gps <- getGEO(x)
          gps <- unlist(gps@header)
          gps <- data.frame(gps, stringsAsFactors = F)
          gps <- t(gps)
          # if gps has multiple rows keep only unique ones
          gps <- unique(gps)
          return(gps)
    }
    dat <- lapply(data, FUN = getGpsInfo)
    # dat is a list with different numbers of elements per entry
    varnames <- unique(unlist(lapply(dat, colnames)))
    dat2 <- data.frame(matrix(NA, nrow = length(dat), ncol = length(varnames)))
    colnames(dat2) <- varnames
    for(i in seq(along=dat)){
          for(j in seq_along(varnames)){
                element <- which(colnames(dat[[i]]) == varnames[j])
                replacement <- dat[[i]][element]
                if (length(replacement) > 0){
                      dat2[i,j] <- replacement
                }
          }
    }
    write.table(dat2, file = "dat2.csv", row.names = T, sep = ";") 
    

    【讨论】:

    • 我收到一个错误,例如 File.exists(destfile) 中的错误:找不到对象 'destfile'
    • 我尝试了另一个库 gpl
    • 我无法重现 destfile 错误。你可以尝试不同的工作目录吗?我编辑了该函数以仅保留唯一的行,因为我注意到行之间通常存在一些差异(您可以通过编辑该代码行来决定要保留哪些行)。我不完全理解你的第二种方法,但第一种应该可以工作。
    • 我试过了,它可以工作,但它不适用于任何其他库,为什么?我修改了我的问题,你能再看一遍吗?真的很感谢我坚持这个
    • 我明白了,“库”是指不同的 GPL 代码,对吧?问题是输出具有不同的列/变量,因此合并不起作用。我要编辑我的答案。
    猜你喜欢
    • 2021-02-12
    • 1970-01-01
    • 1970-01-01
    • 2022-06-30
    • 2022-01-18
    • 2015-04-19
    • 1970-01-01
    • 1970-01-01
    • 2022-08-20
    相关资源
    最近更新 更多