使用 Python 字典计算 txt 文件中的项目答案

【问题标题】：Counting items in txt file with Python dictionaries使用 Python 字典计算 txt 文件中的项目
【发布时间】：2021-11-15 15:44:47
【问题描述】：

我有以下txt文件（只给出了一个片段）

     ## DISTANCE : Shortest distance from variant to transcript
## a lot of comments here
    ## STRAND : Strand of the feature (1/-1)
    ## FLAGS : Transcript quality flags
    #Uploaded_variation     Location        Allele  Gene    Feature Feature_type    Consequence     cDNA_position   CDS_position    Protein_position        Amino_acids     Codons  Existing_variation      Extra
    chr1_69270_A/G  chr1:69270      G       ENSG00000186092 ENST00000335137 Transcript      upstream_gene_variant      216     180     60      S       tcA/tcG -       IMPACT=LOW;STRAND=1
    chr1_69270_A/G  chr1:69270      G       ENSG00000186092 ENST00000641515 Transcript      intron_variant      303     243     81      S       tcA/tcG -       IMPACT=LOW;STRAND=1
    chr1_69511_A/G  chr1:69511      G       ENSG00000186092 ENST00000335137 Transcript      upstream_gene_variant        457     421     141     T/A     Aca/Gca -       IMPACT=MODERATE;STRAND=1

有许多未知的各种ENSG编号，例如ENSG00000187583等。每个ENSG字符串中的整数个数为11。

我必须计算每个基因 (ENSGxxx) 包含多少个 intron_variant 和 upstream_gene_variant。并将其输出到 csv 文件。

我为此使用字典。我试图编写这段代码，但不确定语法是否正确。逻辑应该是：如果这11个数字不在字典中，则应添加值1。如果它们已经在字典中，则应将值更改为x + 1。我目前有此代码，但我不是真正的Python程序员，不确定语法是否正确。

    with open(file, 'rt') as f:
        data = f.readlines()
        Count = 0
        d = {}
        for line in data:
            if line[0] == "#":
                output.write(line)
            if line.__contains__('ENSG'): 
                d[line.split('ENSG')[1][0:11]]=1
                if 1 in d:
                    d=1
                else:
                    Count += 1

有什么建议吗？

谢谢！

【问题讨论】：

您对这段代码的具体问题是什么？ Python 解释器会告诉你语法是否正确。它做你想做的事吗？如果不是，它做错了什么？
它只适用于您提供的示例。 4 {'00000187961': 1, '00000187583': 1} Count = 4。这样可以吗？

标签： python dictionary count bioinformatics contains

【解决方案1】：

你可以试试这个：

from collections import Counter

with open('data.txt') as fp:
    ensg = []
    for line in fp:
        idx = line.find('ENSG')
        if not line.startswith('#') and idx != -1:
            ensg.append(line[idx+4:idx+15])
count = Counter(ensg)

>>> count
Counter({'00000187961': 2, '00000187583': 2})

更新

我需要知道有多少个 ENG 包含“intron_variant”和“upstream_gene_variant”

使用正则表达式提取所需的模式：

from collections import Counter
import re

PAT_ENSG = r'ENSG(?P<ensg>\d{11})'
PAT_VARIANT = r'(?P<variant>intron_variant|upstream_gene_variant)'

PATTERN = re.compile(fr'{PAT_ENSG}.*\b{PAT_VARIANT}\b')

with open('data.txt') as fp:
    ensg = []
    for line in fp:
        sre = PATTERN.search(line)
        if not line.startswith('#') and sre:
            ensg.append(sre.groups())
    count = Counter(ensg)

输出：

>>> count
Counter({('00000186092', 'upstream_gene_variant'): 2,
         ('00000186092', 'intron_variant'): 1})

【讨论】：

谢谢！但现在它输出所有ENSG IDS的计数，但需要知道有多少ENG包含“intron_variant”和“upstream_gene_variant”
我建议您发布样本的预期结果。这对我来说会更容易。
谢谢！实际上想要的结果是 - 计算每个基因有多少 upstream_gene_variant 或 intron_variant？
@user15480777。我更新了我的答案。请检查我的答案好吗？

【解决方案2】：

这是您要求的另一种解释：-

我已修改您的示例数据，使第一个 ENG 值为 ENSG00000187971，以突出显示其工作原理。

D = {}

with open('eng.txt') as eng:
    for line in eng:
        if not line.startswith('#'):
            t = line.split()
            V = t[6]
            E = t[3]
            if not V in D:
                D[V] = {}
            if not E in D[V]:
                D[V][E] = 1
            else:
                D[V][E] += 1
print(D)

这个的输出是：-

{'intron_variant': {'ENSG00000187971': 1, 'ENSG00000187961': 1}, 'upstream_gene_variant': {'ENSG00000187583': 2}}

所以你现在拥有的是一个以变体为键的字典。每个变体都有自己的字典，由 ENSG 值和每个 ENSG 值的出现次数作为键控

【讨论】：

你好，我试过这个，但现在得到以下信息：“IndexError: list index out of range”。我使用 Python 3
您的数据在内容或结构上必须与您最初发布的样本不同
您好，我更新了示例。标头有区别
这将适用于您的数据。您刚刚在示例中添加了注释行，此代码说明了