如何使用 python 在两个文件中找到匹配的 DNA 序列？答案

【问题标题】：How can I find matching DNA sequences in two files, using python?如何使用 python 在两个文件中找到匹配的 DNA 序列？
【发布时间】：2020-11-14 22:54:18
【问题描述】：

我一直在研究这个问题很长时间，但我似乎无法正确解决这个问题。我已经尝试了我所知道的一切，我真的需要帮助找出问题所在。

我有两个文件：文件 1：序列的名称和数量列表，即：

name,AGAT,AATG,TATC
Jake,28,42,14
Chris,17,22,19
Anne,36,18,25

和文件二：一串DNA "GCTAAATTTGTTCAGCCAGATGTAGGCTTACAAATCAAGCTGTCCGCTCGGCACGGCCTACACACGT..."

这个想法是实施一个程序，该程序可以根据他们的 DNA 来识别一个人。遍历文件 2，并计算文件 1 中提供的序列的出现次数。如果两个文件中出现的次数匹配，则返回名称。不幸的是，我似乎无法获得第二个文件的正确“总数”。

这是我目前所拥有的：

Python：

with open(argv[1], 'r') as csvfile:
    csvfile_data = csv.reader(csvfile)
    next(csvfile_data)          #skip first line
    for row in csvfile_data:
        list_temp = row
        
        # copy elements into a new list
        temp = []
        temp.extend(list_temp)
        
        # remove the first element, because its the name
        name = temp.pop(0)

        # the values attached to the name
        csvlist = temp
        #change strings in list to integers
        csvlist = [int(i) for i in csvlist]

# open dna sequence also
with open(argv[2], 'r') as dnafile:
    dnafile_data = dnafile.read()
    
    #use regular expressions to find each sequence's occurence in the file
    patterns = re.compile(r'AGATC|TTTTTTCT|AATG|TCTAG|GATA|TATC|GAAA|TCTG')
    result = re.findall(patterns, dnafile_data) 
    
    #count each sequence's occurence
    dictionary = Counter(result)
    
    #split the key sand values into a new list 
    dnalist = dictionary.values()
    print(dnalist)
    
if collections.Counter(csvlist) == collections.Counter(dnalist):
    print(name)
else:
    print("No match")
    ```

【问题讨论】：

我认为这需要一个（简短的）示例来说明输入数据是什么，才能开始回答。我也不清楚您期望重叠序列会发生什么。
@DavidW 感谢您的回复，我对此有点陌生。我将编辑输入数据的示例并将其添加到问题中。
不知道我害怕。它的后半部分看起来应该可以工作（假设序列不重叠）。 csvlist 仅用于您的第一个文件的最后一行。

标签： python csv dna-sequence

【解决方案1】：

您可以使用简单的产品推荐引擎背后的逻辑，例如：

def sequence(string):
    count_AGAT = string.count('AGAT')
    count_AATG = string.count('AATG')
    count_TATC = string.count('TATC')
    print(count_AGAT)
    print(count_AATG)
    print(count_TATC)

    data_dna = {'Name': ['01'],
                    'AGAT': [count_AGAT],
                    'AATG': [count_AATG],
                    'TATC': [count_TATC]}
    df_dna = pd.DataFrame(data_dna)
    print(df_dna)
    
sequence('TCATCTAGGAGGCGCGCGTAGGATAAATAATTCAATTAAGATGTCGTTTTGC...')

你会得到一个数据框输出，例如：

40
31
42
  Name  AGAT  AATG  TATC
0   01    40    31    42

然后将新行追加到已经可用的数据框：

df = df.append(df_dna, ignore_index = True)
print(df)
df = df.drop('Name',1)
print(df)

输出将是：

 Name  AGAT  AATG  TATC
0   Jake    28    42    14
1  Chris    17    22    19
2   Anne    36    18    25
3     01    40    31    42
   AGAT  AATG  TATC
0    28    42    14
1    17    22    19
2    36    18    25
3    40    31    42

将行保存到单独的变量中：

df_jake = df.iloc[0]
df_chris = df.iloc[1]
df_anne = df.iloc[2]
df_sequence = df.iloc[3]
print(df_jake)

获取输出：

AGAT    28
AATG    42
TATC    14
Name: 0, dtype: int64

并使用 spatial.distance.euclidean 值构建具有协同过滤的推荐引擎（帮助：https://realpython.com/build-recommendation-engine-collaborative-filtering/）：

from scipy import spatial
diff_jake = spatial.distance.euclidean(df_sequence, df_jake)
diff_chris = spatial.distance.euclidean(df_sequence, df_chris)
diff_anne = spatial.distance.euclidean(df_sequence, df_anne)
print('Jake: ', diff_jake)
print('Chris: ', diff_chris)
print('Anne: ', diff_anne)

在本例中，您获得的输出将是：

Jake:  32.38826948140329
Chris:  33.74907406137241
Anne:  21.77154105707724

所以提供的 dna 序列可能更类似于 Anne 的。

您可以使用 scipy.spatial.distance.euclidean 来计算距离两点之间。使用它来计算之间的距离 A、B 和 D 对 C 的评级向我们表明，在距离，C的评分最接近B的评分

>>> spatial.distance.euclidean(c, a)
2.5
>>> spatial.distance.euclidean(c, b)
0.5
>>> spatial.distance.euclidean(c, d)
2.23606797749979

你也可以使用余弦距离向量

要使用角度计算相似度，您需要一个返回的函数较小的角度和较低的距离具有较高的相似性或较小的距离更高角度的相似性或更大距离。的余弦角度是一个随着角度增加而从 1 减小到 -1 的函数从 0 到 180。

【讨论】：