从文本文件中提取一组行答案

【问题标题】：Extracting a set of lines from a text file从文本文件中提取一组行
【发布时间】：2020-12-06 19:06:59
【问题描述】：

我有一组文本文件，比如https://www.uniprot.org/uniprot/A0R4Q6.txt

我正在尝试编写一个函数，该函数将 UniProt ID 作为输入，然后输出具有以下格式的数据帧（最好我可以用作 scikit-learn 的输入？）（为清楚起见仅以逗号分隔):

UniProt-ID,Position,AA   
A0R4Q6,1,M
A0R4Q6,2,T
A0R4Q6,3,Q

这是我目前正在使用的：

def get_features(ID):
    featureList=[]
    #set and open link to uniprot webiste
    link="https://www.uniprot.org/uniprot/{}.txt".format(ID) 
    file = urllib.request.urlopen(link)
    #find amino acid sequence
    for line in file:
        nextLine = next(file)
        #print(nextLine)
        if b'SQ' in line:
            print(line)
            #unsure how to extract more than 1 line
            #additionally, the number of lines that
            #I will need will be variable, depending on the protein length
            
            #this is what I think the extracted lines put into a string will look like
            aaSeq='MTQMLTRPDV\tDLVNGMFYAD\tGGAREAYRWM\tRANEPVFRDR\tNGLAAATTYQ\tAVLDAERNPE\nLFSSTGGIRP\tDQPGMPYMID'
            #remove \t and \n characters
            ActualSeq=re.sub('\s+', '', aaSeq)
            print(ActualSeq)
    #now iterate through the string to create dataframe?
    p=1
    for i in ActualSeq:
        featureList.append([ID,p,i])
        p+=1
    return featureList
seq=get_features('A0R4Q6')
print(seq)

我有两个问题：

搜索 b'SQ' 不会返回任何内容，但如果我搜索 b'ID' 或 b'FT' 等，此语法确实有效。任何想法为什么它无法识别 'SQ'？
我不知道如何让这个 for 循环返回 'SQ' 行之后的所有行，直到包含 '//' 的最后一行并将其压缩成一个字符串。

此外，这种将“数据框”放入元组列表的方法是最有效的，还是我应该做一些完全不同的事情？最终目标是将此数据帧用作 SciKit-Learn 随机森林的输入。

TIA！

【问题讨论】：

您正在跳过带有nextLine = next(file) 行的一半数据，因为您在该行中推进file 迭代器
@fdermishin，谢谢，已修复！

标签： python dataframe python-re

【解决方案1】：

要获得您要求的确切输出，请尝试以下操作：

def get_features(ID):
    featureList=[]

    # Set and open link to uniprot webiste
    link="https://www.uniprot.org/uniprot/{}.txt".format(ID) 
    file = urllib.request.urlopen(link)

    found_seq = False
    full_sec = ''
    
    # Find amino acid sequence
    for line in file:
      if line.startswith(b'SQ   '):
        found_seq = True
      elif found_seq and line.startswith(b'     '):
        line = ''.join(line.decode("utf-8").split())
        # print(line)
        full_sec += line
      else:
        found_seq = False

    # Enumerate items
    for i, a in enumerate(full_sec):
      featureList.append([ID, i+1, a])
    return featureList


seq = get_features('A0R4Q6')

for item in seq:
  print(item)

它将打印以下内容：

['A0R4Q6', 1, 'M']
['A0R4Q6', 2, 'T']
['A0R4Q6', 3, 'Q']
['A0R4Q6', 4, 'M']
['A0R4Q6', 5, 'L']
...

【讨论】：