【问题标题】:Merge multiple tables (identical headers) within a text file合并文本文件中的多个表(相同的标题)
【发布时间】:2022-01-09 02:20:32
【问题描述】:

假设我有 200 多个文件,每个文件的结构如下所示:

# Peptide length 11
# Rank Threshold for Strong binding peptides   0.500
# Rank Threshold for Weak binding peptides   2.000
-----------------------------------------------------------------------------------
  pos          HLA      peptide         Core Offset  I_pos  I_len  D_pos  D_len        iCore        Identity 1-log50k(aff) Affinity(nM)    %Rank  BindLevel
-----------------------------------------------------------------------------------
    0    HLA-B4402  GSHDLGIILQK    GSHDLGIIL      0      0      0      0      0    GSHDLGIIL NM_000094_3_COL         0.015     42580.79    90.00
    1    HLA-B4402  SHDLGIILQKI    SLGIILQKI      0      0      0      1      2  SHDLGIILQKI NM_000094_3_COL         0.024     38731.55    65.00
    2    HLA-B4402  HDLGIILQKIR    HDLIILQKI      0      0      0      3      1   HDLGIILQKI NM_000094_3_COL         0.024     38400.24    65.00
    3    HLA-B4402  DLGIILQKIRD    DLGIILQKI      0      0      0      0      0    DLGIILQKI NM_000094_3_COL         0.011     44267.78    95.00
    4    HLA-B4402  LGIILQKIRDM    LGIILQRDM      0      0      0      6      2  LGIILQKIRDM NM_000094_3_COL         0.024     38411.46    65.00
    5    HLA-B4402  GIILQKIRDMP    GIILQIRDM      0      0      0      5      1   GIILQKIRDM NM_000094_3_COL         0.017     41463.75    80.00
    6    HLA-B4402  IILQKIRDMPY    IILQKIRDY      0      0      0      8      2  IILQKIRDMPY NM_000094_3_COL         0.025     38152.18    65.00
    7    HLA-B4402  ILQKIRDMPYM    ILQKIRMPY      0      0      0      6      1   ILQKIRDMPY NM_000094_3_COL         0.025     37993.98    60.00
    8    HLA-B4402  LQKIRDMPYMD    QKIRDMPYM      1      0      0      0      0    QKIRDMPYM NM_000094_3_COL         0.015     42595.54    90.00
    9    HLA-B4402  QKIRDMPYMDP    QKIRDMPYM      0      0      0      0      0    QKIRDMPYM NM_000094_3_COL         0.017     41645.82    85.00
   10    HLA-B4402  KIRDMPYMDPS    KDMPYMDPS      0      0      0      1      2  KIRDMPYMDPS NM_000094_3_COL         0.023     39039.53    70.00
   11    HLA-B4402  IRDMPYMDPSX    RDMPYMPSX      1      0      0      6      1   RDMPYMDPSX NM_000094_3_COL         0.036     33871.57    41.00
-----------------------------------------------------------------------------------

Protein NM_000094_3_COL. Allele HLA-B4402. Number of high binders 0. Number of weak binders 0. Number of peptides 12

-----------------------------------------------------------------------------------
# Rank Threshold for Strong binding peptides   0.500
# Rank Threshold for Weak binding peptides   2.000
-----------------------------------------------------------------------------------
  pos          HLA      peptide         Core Offset  I_pos  I_len  D_pos  D_len        iCore        Identity 1-log50k(aff) Affinity(nM)    %Rank  BindLevel
-----------------------------------------------------------------------------------
    0    HLA-B4402  PVTGYKVQYTS    TGYKVQYTS      2      0      0      0      0    TGYKVQYTS NM_000094_3_COL         0.011     44190.25    95.00
    1    HLA-B4402  VTGYKVQYTSL    VTGYQYTSL      0      0      0      4      2  VTGYKVQYTSL NM_000094_3_COL         0.020     40061.36    75.00
    2    HLA-B4402  TGYKVQYTSLT    TGYKVYTSL      0      0      0      5      1   TGYKVQYTSL NM_000094_3_COL         0.020     40487.08    75.00
    3    HLA-B4402  GYKVQYTSLTG    YVQYTSLTG      1      0      0      1      1   YKVQYTSLTG NM_000094_3_COL         0.017     41521.20    80.00
    4    HLA-B4402  YKVQYTSLTGL    YQYTSLTGL      0      0      0      1      2  YKVQYTSLTGL NM_000094_3_COL         0.031     35710.76    49.00
    5    HLA-B4402  KVQYTSLTGLG    KVQYTSLTL      0      0      0      8      1   KVQYTSLTGL NM_000094_3_COL         0.029     36392.20    55.00
    6    HLA-B4402  VQYTSLTGLGQ    VQYTSLTGL      0      0      0      0      0    VQYTSLTGL NM_000094_3_COL         0.016     42180.50    85.00
    7    HLA-B4402  QYTSLTGLGQP    QYTSLTGLG      0      0      0      0      0    QYTSLTGLG NM_000094_3_COL         0.011     44293.17    95.00
    8    HLA-B4402  YTSLTGLGQPL    YTSLLGQPL      0      0      0      4      2  YTSLTGLGQPL NM_000094_3_COL         0.034     34547.04    44.00
    9    HLA-B4402  TSLTGLGQPLP    SLTGLGQPL      1      0      0      0      0    SLTGLGQPL NM_000094_3_COL         0.024     38475.10    65.00
   10    HLA-B4402  SLTGLGQPLPS    SLTGLGQPL      0      0      0      0      0    SLTGLGQPL NM_000094_3_COL         0.026     37575.76    60.00
   11    HLA-B4402  LTGLGQPLPSX    LLGQPLPSX      0      0      0      1      2  LTGLGQPLPSX NM_000094_3_COL         0.014     42874.84    90.00
-----------------------------------------------------------------------------------

Protein NM_000094_3_COL. Allele HLA-B4402. Number of high binders 0. Number of weak binders 0. Number of peptides 12

-----------------------------------------------------------------------------------
# Rank Threshold for Strong binding peptides   0.500
# Rank Threshold for Weak binding peptides   2.000
-----------------------------------------------------------------------------------
  pos          HLA      peptide         Core Offset  I_pos  I_len  D_pos  D_len        iCore        Identity 1-log50k(aff) Affinity(nM)    %Rank  BindLevel
-----------------------------------------------------------------------------------
    0    HLA-B4402  FLRLLDLAQEE    RLLDLAQEE      2      0      0      0      0    RLLDLAQEE NM_000106_5_CYP         0.014     42841.45    90.00
    1    HLA-B4402  LRLLDLAQEEL    RLLDLAQEL      1      0      0      7      1   RLLDLAQEEL NM_000106_5_CYP         0.029     36648.25    55.00
    2    HLA-B4402  RLLDLAQEELK    RLLDLAQEL      0      0      0      7      1   RLLDLAQEEL NM_000106_5_CYP         0.029     36350.87    55.00
    3    HLA-B4402  LLDLAQEELKE    LLDLAQEEL      0      0      0      0      0    LLDLAQEEL NM_000106_5_CYP         0.013     43487.79    95.00
    4    HLA-B4402  LDLAQEELKEE    LDQEELKEE      0      0      0      2      2  LDLAQEELKEE NM_000106_5_CYP         0.008     45629.40    99.00
    5    HLA-B4402  DLAQEELKEES    AQEELKEES      2      0      0      0      0    AQEELKEES NM_000106_5_CYP         0.009     45287.57    99.00
    6    HLA-B4402  LAQEELKEESG    AEELKEESG      1      0      0      1      1   AQEELKEESG NM_000106_5_CYP         0.013     43568.32    95.00
    7    HLA-B4402  AQEELKEESGF    AELKEESGF      0      0      0      1      2  AQEELKEESGF NM_000106_5_CYP         0.231      4113.65     2.50
    8    HLA-B4402  QEELKEESGFL    QELKEESGF      0      0      0      1      1   QEELKEESGF NM_000106_5_CYP         0.123     13202.71     6.00
    9    HLA-B4402  EELKEESGFLR    EELKEESGF      0      0      0      0      0    EELKEESGF NM_000106_5_CYP         0.076     21904.46    13.00
   10    HLA-B4402  ELKEESGFLRE    ELKEESGFL      0      0      0      0      0    ELKEESGFL NM_000106_5_CYP         0.030     36301.74    55.00
   11    HLA-B4402  LKEESGFLREX    KEESFLREX      1      0      0      4      1   KEESGFLREX NM_000106_5_CYP         0.060     26205.35    19.00
-----------------------------------------------------------------------------------

可以看出,每个文件基本上都是表格的组合(具有相同的标题),表格之间有文本。我想只保留表格 - 如果可能的话,去掉那些虚线,只保留每行用 \t 分隔的数据(和标题)。

最佳结果是这样的:

pos          HLA      peptide         Core Offset  I_pos  I_len  D_pos  D_len        iCore        Identity 1-log50k(aff) Affinity(nM)    %Rank  BindLevel
    0    HLA-B4402  GSHDLGIILQK    GSHDLGIIL      0      0      0      0      0    GSHDLGIIL NM_000094_3_COL         0.015     42580.79    90.00
    1    HLA-B4402  SHDLGIILQKI    SLGIILQKI      0      0      0      1      2  SHDLGIILQKI NM_000094_3_COL         0.024     38731.55    65.00
    2    HLA-B4402  HDLGIILQKIR    HDLIILQKI      0      0      0      3      1   HDLGIILQKI NM_000094_3_COL         0.024     38400.24    65.00
    3    HLA-B4402  DLGIILQKIRD    DLGIILQKI      0      0      0      0      0    DLGIILQKI NM_000094_3_COL         0.011     44267.78    95.00
    4    HLA-B4402  LGIILQKIRDM    LGIILQRDM      0      0      0      6      2  LGIILQKIRDM NM_000094_3_COL         0.024     38411.46    65.00
    5    HLA-B4402  GIILQKIRDMP    GIILQIRDM      0      0      0      5      1   GIILQKIRDM NM_000094_3_COL         0.017     41463.75    80.00
    6    HLA-B4402  IILQKIRDMPY    IILQKIRDY      0      0      0      8      2  IILQKIRDMPY NM_000094_3_COL         0.025     38152.18    65.00
    7    HLA-B4402  ILQKIRDMPYM    ILQKIRMPY      0      0      0      6      1   ILQKIRDMPY NM_000094_3_COL         0.025     37993.98    60.00
    8    HLA-B4402  LQKIRDMPYMD    QKIRDMPYM      1      0      0      0      0    QKIRDMPYM NM_000094_3_COL         0.015     42595.54    90.00
    9    HLA-B4402  QKIRDMPYMDP    QKIRDMPYM      0      0      0      0      0    QKIRDMPYM NM_000094_3_COL         0.017     41645.82    85.00
   10    HLA-B4402  KIRDMPYMDPS    KDMPYMDPS      0      0      0      1      2  KIRDMPYMDPS NM_000094_3_COL         0.023     39039.53    70.00
   11    HLA-B4402  IRDMPYMDPSX    RDMPYMPSX      1      0      0      6      1   RDMPYMDPSX NM_000094_3_COL         0.036     33871.57    41.00
    0    HLA-B4402  PVTGYKVQYTS    TGYKVQYTS      2      0      0      0      0    TGYKVQYTS NM_000094_3_COL         0.011     44190.25    95.00
    1    HLA-B4402  VTGYKVQYTSL    VTGYQYTSL      0      0      0      4      2  VTGYKVQYTSL NM_000094_3_COL         0.020     40061.36    75.00
    2    HLA-B4402  TGYKVQYTSLT    TGYKVYTSL      0      0      0      5      1   TGYKVQYTSL NM_000094_3_COL         0.020     40487.08    75.00
    3    HLA-B4402  GYKVQYTSLTG    YVQYTSLTG      1      0      0      1      1   YKVQYTSLTG NM_000094_3_COL         0.017     41521.20    80.00
    4    HLA-B4402  YKVQYTSLTGL    YQYTSLTGL      0      0      0      1      2  YKVQYTSLTGL NM_000094_3_COL         0.031     35710.76    49.00
    5    HLA-B4402  KVQYTSLTGLG    KVQYTSLTL      0      0      0      8      1   KVQYTSLTGL NM_000094_3_COL         0.029     36392.20    55.00
    6    HLA-B4402  VQYTSLTGLGQ    VQYTSLTGL      0      0      0      0      0    VQYTSLTGL NM_000094_3_COL         0.016     42180.50    85.00
    7    HLA-B4402  QYTSLTGLGQP    QYTSLTGLG      0      0      0      0      0    QYTSLTGLG NM_000094_3_COL         0.011     44293.17    95.00
    8    HLA-B4402  YTSLTGLGQPL    YTSLLGQPL      0      0      0      4      2  YTSLTGLGQPL NM_000094_3_COL         0.034     34547.04    44.00
    9    HLA-B4402  TSLTGLGQPLP    SLTGLGQPL      1      0      0      0      0    SLTGLGQPL NM_000094_3_COL         0.024     38475.10    65.00
   10    HLA-B4402  SLTGLGQPLPS    SLTGLGQPL      0      0      0      0      0    SLTGLGQPL NM_000094_3_COL         0.026     37575.76    60.00
   11    HLA-B4402  LTGLGQPLPSX    LLGQPLPSX      0      0      0      1      2  LTGLGQPLPSX NM_000094_3_COL         0.014     42874.84    90.00
    0    HLA-B4402  FLRLLDLAQEE    RLLDLAQEE      2      0      0      0      0    RLLDLAQEE NM_000106_5_CYP         0.014     42841.45    90.00
    1    HLA-B4402  LRLLDLAQEEL    RLLDLAQEL      1      0      0      7      1   RLLDLAQEEL NM_000106_5_CYP         0.029     36648.25    55.00
    2    HLA-B4402  RLLDLAQEELK    RLLDLAQEL      0      0      0      7      1   RLLDLAQEEL NM_000106_5_CYP         0.029     36350.87    55.00
    3    HLA-B4402  LLDLAQEELKE    LLDLAQEEL      0      0      0      0      0    LLDLAQEEL NM_000106_5_CYP         0.013     43487.79    95.00
    4    HLA-B4402  LDLAQEELKEE    LDQEELKEE      0      0      0      2      2  LDLAQEELKEE NM_000106_5_CYP         0.008     45629.40    99.00
    5    HLA-B4402  DLAQEELKEES    AQEELKEES      2      0      0      0      0    AQEELKEES NM_000106_5_CYP         0.009     45287.57    99.00
    6    HLA-B4402  LAQEELKEESG    AEELKEESG      1      0      0      1      1   AQEELKEESG NM_000106_5_CYP         0.013     43568.32    95.00
    7    HLA-B4402  AQEELKEESGF    AELKEESGF      0      0      0      1      2  AQEELKEESGF NM_000106_5_CYP         0.231      4113.65     2.50
    8    HLA-B4402  QEELKEESGFL    QELKEESGF      0      0      0      1      1   QEELKEESGF NM_000106_5_CYP         0.123     13202.71     6.00
    9    HLA-B4402  EELKEESGFLR    EELKEESGF      0      0      0      0      0    EELKEESGF NM_000106_5_CYP         0.076     21904.46    13.00
   10    HLA-B4402  ELKEESGFLRE    ELKEESGFL      0      0      0      0      0    ELKEESGFL NM_000106_5_CYP         0.030     36301.74    55.00
   11    HLA-B4402  LKEESGFLREX    KEESFLREX      1      0      0      4      1   KEESGFLREX NM_000106_5_CYP         0.060     26205.35    19.00

这就是我正在努力解决的问题:

1。如何将同一文件中的所有表连接到一个表中?

2。是否可以将所有文件中的所有表连接到一个表中?

如果有办法在 R 中做到这一点,也可以。

非常感谢!

PS:我浏览了类似问题部分,但在这一行中找不到任何解决方案。

【问题讨论】:

    标签: r linux dataframe


    【解决方案1】:

    应该是这样的:

    df_list <- lapply(file_names, read.table, skip = 6)
    df <- do.call('rbind', df_list)
    

    然后在末尾添加您的列名。

    【讨论】:

      【解决方案2】:

      这将从一个文件中提取和解析数据。

      我尝试拆分数据并添加标题,但我不能 100% 确定它是否正常工作,

      library(dplyr)
      
      original_df <-
        as.data.frame(readLines("ProteinData.txt", warn = FALSE))
      
      colnames(original_df) <- c("Column1")
      
      header <- original_df %>% filter(str_detect(Column1, "^\\s+pos"))
      
      header <- unlist(str_split(head(header, 1), "\\s+"))
      
      header <- replace(header, header == "" , "Unused")
      
      parsed_df <- original_df %>%
        filter(str_detect(Column1, "^\\W+\\d")) %>%
        separate(Column1, header, sep = "\\s+") %>%
        select(!c(1))
      
      pos HLA peptide Core Offset I_pos I_len D_pos D_len iCore Identity 1-log50k(aff) Affinity(nM) %Rank BindLevel
      0 HLA-B4402 GSHDLGIILQK GSHDLGIIL 0 0 0 0 0 GSHDLGIIL NM_000094_3_COL 0.015 42580.79 90.00 NA
      1 HLA-B4402 SHDLGIILQKI SLGIILQKI 0 0 0 1 2 SHDLGIILQKI NM_000094_3_COL 0.024 38731.55 65.00 NA
      2 HLA-B4402 HDLGIILQKIR HDLIILQKI 0 0 0 3 1 HDLGIILQKI NM_000094_3_COL 0.024 38400.24 65.00 NA
      3 HLA-B4402 DLGIILQKIRD DLGIILQKI 0 0 0 0 0 DLGIILQKI NM_000094_3_COL 0.011 44267.78 95.00 NA
      4 HLA-B4402 LGIILQKIRDM LGIILQRDM 0 0 0 6 2 LGIILQKIRDM NM_000094_3_COL 0.024 38411.46 65.00 NA
      5 HLA-B4402 GIILQKIRDMP GIILQIRDM 0 0 0 5 1 GIILQKIRDM NM_000094_3_COL 0.017 41463.75 80.00 NA
      6 HLA-B4402 IILQKIRDMPY IILQKIRDY 0 0 0 8 2 IILQKIRDMPY NM_000094_3_COL 0.025 38152.18 65.00 NA
      7 HLA-B4402 ILQKIRDMPYM ILQKIRMPY 0 0 0 6 1 ILQKIRDMPY NM_000094_3_COL 0.025 37993.98 60.00 NA
      8 HLA-B4402 LQKIRDMPYMD QKIRDMPYM 1 0 0 0 0 QKIRDMPYM NM_000094_3_COL 0.015 42595.54 90.00 NA
      9 HLA-B4402 QKIRDMPYMDP QKIRDMPYM 0 0 0 0 0 QKIRDMPYM NM_000094_3_COL 0.017 41645.82 85.00 NA
      10 HLA-B4402 KIRDMPYMDPS KDMPYMDPS 0 0 0 1 2 KIRDMPYMDPS NM_000094_3_COL 0.023 39039.53 70.00 NA
      11 HLA-B4402 IRDMPYMDPSX RDMPYMPSX 1 0 0 6 1 RDMPYMDPSX NM_000094_3_COL 0.036 33871.57 41.00 NA
      0 HLA-B4402 PVTGYKVQYTS TGYKVQYTS 2 0 0 0 0 TGYKVQYTS NM_000094_3_COL 0.011 44190.25 95.00 NA
      1 HLA-B4402 VTGYKVQYTSL VTGYQYTSL 0 0 0 4 2 VTGYKVQYTSL NM_000094_3_COL 0.020 40061.36 75.00 NA
      2 HLA-B4402 TGYKVQYTSLT TGYKVYTSL 0 0 0 5 1 TGYKVQYTSL NM_000094_3_COL 0.020 40487.08 75.00 NA
      3 HLA-B4402 GYKVQYTSLTG YVQYTSLTG 1 0 0 1 1 YKVQYTSLTG NM_000094_3_COL 0.017 41521.20 80.00 NA
      4 HLA-B4402 YKVQYTSLTGL YQYTSLTGL 0 0 0 1 2 YKVQYTSLTGL NM_000094_3_COL 0.031 35710.76 49.00 NA
      5 HLA-B4402 KVQYTSLTGLG KVQYTSLTL 0 0 0 8 1 KVQYTSLTGL NM_000094_3_COL 0.029 36392.20 55.00 NA
      6 HLA-B4402 VQYTSLTGLGQ VQYTSLTGL 0 0 0 0 0 VQYTSLTGL NM_000094_3_COL 0.016 42180.50 85.00 NA
      7 HLA-B4402 QYTSLTGLGQP QYTSLTGLG 0 0 0 0 0 QYTSLTGLG NM_000094_3_COL 0.011 44293.17 95.00 NA
      8 HLA-B4402 YTSLTGLGQPL YTSLLGQPL 0 0 0 4 2 YTSLTGLGQPL NM_000094_3_COL 0.034 34547.04 44.00 NA
      9 HLA-B4402 TSLTGLGQPLP SLTGLGQPL 1 0 0 0 0 SLTGLGQPL NM_000094_3_COL 0.024 38475.10 65.00 NA
      10 HLA-B4402 SLTGLGQPLPS SLTGLGQPL 0 0 0 0 0 SLTGLGQPL NM_000094_3_COL 0.026 37575.76 60.00 NA
      11 HLA-B4402 LTGLGQPLPSX LLGQPLPSX 0 0 0 1 2 LTGLGQPLPSX NM_000094_3_COL 0.014 42874.84 90.00 NA
      0 HLA-B4402 FLRLLDLAQEE RLLDLAQEE 2 0 0 0 0 RLLDLAQEE NM_000106_5_CYP 0.014 42841.45 90.00 NA
      1 HLA-B4402 LRLLDLAQEEL RLLDLAQEL 1 0 0 7 1 RLLDLAQEEL NM_000106_5_CYP 0.029 36648.25 55.00 NA
      2 HLA-B4402 RLLDLAQEELK RLLDLAQEL 0 0 0 7 1 RLLDLAQEEL NM_000106_5_CYP 0.029 36350.87 55.00 NA
      3 HLA-B4402 LLDLAQEELKE LLDLAQEEL 0 0 0 0 0 LLDLAQEEL NM_000106_5_CYP 0.013 43487.79 95.00 NA
      4 HLA-B4402 LDLAQEELKEE LDQEELKEE 0 0 0 2 2 LDLAQEELKEE NM_000106_5_CYP 0.008 45629.40 99.00 NA
      5 HLA-B4402 DLAQEELKEES AQEELKEES 2 0 0 0 0 AQEELKEES NM_000106_5_CYP 0.009 45287.57 99.00 NA
      6 HLA-B4402 LAQEELKEESG AEELKEESG 1 0 0 1 1 AQEELKEESG NM_000106_5_CYP 0.013 43568.32 95.00 NA
      7 HLA-B4402 AQEELKEESGF AELKEESGF 0 0 0 1 2 AQEELKEESGF NM_000106_5_CYP 0.231 4113.65 2.50 NA
      8 HLA-B4402 QEELKEESGFL QELKEESGF 0 0 0 1 1 QEELKEESGF NM_000106_5_CYP 0.123 13202.71 6.00 NA
      9 HLA-B4402 EELKEESGFLR EELKEESGF 0 0 0 0 0 EELKEESGF NM_000106_5_CYP 0.076 21904.46 13.00 NA
      10 HLA-B4402 ELKEESGFLRE ELKEESGFL 0 0 0 0 0 ELKEESGFL NM_000106_5_CYP 0.030 36301.74 55.00 NA
      11 HLA-B4402 LKEESGFLREX KEESFLREX 1 0 0 4 1 KEESGFLREX NM_000106_5_CYP 0.060 26205.35 19.00 NA

      【讨论】:

        猜你喜欢
        • 2021-06-15
        • 2020-12-19
        • 2019-06-25
        • 1970-01-01
        • 2013-05-29
        • 2013-04-15
        • 1970-01-01
        • 1970-01-01
        • 2017-06-18
        相关资源
        最近更新 更多