匹配和合并两个文本表？答案

【问题标题】：Matching and merging two text tables?匹配和合并两个文本表？
【发布时间】：2014-02-11 12:09:09
【问题描述】：

我有 2 个（相当大，约 15k 行）csv 表，格式如下：

Disease/Trait                Mapped_gene    p-Value 
Wegener's granulomatosis    HLA-DPB1        2.00E-50    
Wegener's granulomatosis    TENM3 - DCTD    2.00E-06    
Brugada syndrome            SCN5A           1.00E-14    
Brugada syndrome            SCN10A          1.00E-68    
Brugada syndrome            HEY2 - NCOA7    5.00E-17    
Major depressive disorder   IRF8 - FENDRR   3.00E-07    


Identifier  Homologues  Symbol
CG11621     5286    HEY2
CG11621     5287    IRF8
CG11621     5287    PIK3C2B
CG11621     5288    PIK3C2G
CG11621     5288    PIK3C2G
CG11949     2035    DCTD
CG11949     2035    EPB41
CG11949     2036    EPB41L1
CG11949     2037    EPB41L2

我想使用 Python 来比较表，这样如果表 2 中的任何“符号”列与表 1 中的“Mapped_gene”匹配，则可以将每个表中的匹配行合并在一起并放入输出文件。

我尝试过使用 Pandas 插件，但无法使用。有没有人有更好的想法？

谢谢。

【问题讨论】：

csv 文件有多大？只是想这可以在内存中完成还是我需要一个数据库？
每个定位基因可以有不止一种疾病吗？（即第一个表中的两行或多行在Mapped_gene 列中具有相同的值）

标签： python python-2.7 csv genetics

【解决方案1】：

这应该可以按您的意愿工作：

import csv

diseases = {}

# Load the disease file in memory
with csv.reader(open('table1.csv', 'rb')) as dfile:
    # Skip the header
    dfile.next()
    for disease, gene, pvalue in dfile:
        diseases[gene] = (disease, pvalue)

with csv.reader(open('table2.csv', 'rb')) as idfile, csv.writer(open('output.csv', 'wb')) as output:
    # Skip the header
    idfile.next()
    for ident, homologue, symbol in idfile:
        if symbol in diseases:
            output.writerow((ident, homologue, symbol) + diseases[symbol])

它假定Mapped_gene 下的每个基因名称都是唯一的。它可以很容易地扩展以处理重复，否则。

【讨论】：

完美！非常感谢。
请注意，它将首先将整个疾病文件加载到内存中。 15000 行不应该太多（可能低于 1MB），但如果你想要更大的东西（很多），请记住这一点