假设您的预期结果中有错字,实际上应该是
/52MOErT/*/i2MOErG/*/i2MOErT/*A*C*G*T*T*G*C*T*C*C*/i2MOErG/*/i2MOErA/*/32MOErC/ 下面的代码将起作用:
# python3
def encode_sequence(seq):
seq_front = seq[:3]
seq_back = seq[-3:]
seq_middle = seq[3:-3]
front_ix = ["/52MOEr{}/", "/i2MOEr{}/", "/i2MOEr{}/"]
back_ix = ["/i2MOEr{}/", "/i2MOEr{}/", "/32MOEr{}/"]
encoded = []
for base, index in zip(seq_front, front_ix):
encoded.append(index.format(base))
encoded.extend(seq_middle)
for base, index in zip(seq_back, back_ix):
encoded.append(index.format(base))
return "*".join(encoded)
通读代码并确保您理解它。本质上,我们只是对原始字符串进行切片并将碱基插入到您需要的格式中。最终输出的每个元素都被添加到一个列表中,并在末尾用 * 字符连接。
如果您需要动态指定从序列前后提取的碱基的数量和名称,您可以使用此版本。注意{} 大括号告诉string.format 函数在哪里插入基数。
def encode_sequence_2(seq, front_ix, back_ix):
seq_front = seq[:len(front_ix)]
seq_back = seq[-len(back_ix):]
seq_middle = seq[len(front_ix):-len(back_ix)]
encoded = []
for base, index in zip(seq_front, front_ix):
encoded.append(index.format(base))
encoded.extend(seq_middle)
for base, index in zip(seq_back, back_ix):
encoded.append(index.format(base))
return "*".join(encoded)
这是输出:
> seq = "TGTACGTTGCTCCGAC"
> encode_sequence(seq)
/52MOErT/*/i2MOErG/*/i2MOErT/*A*C*G*T*T*G*C*T*C*C*/i2MOErG/*/i2MOErA/*/32MOErC/
如果您有要编码的序列列表,您可以遍历该列表并对每个序列进行编码:
encoded_list = []
for seq in dna_list:
encoded_list.append(encode_sequence(seq))
或者使用列表理解:
encoded_list = [encode_sequence(seq) for seq in dna_list)]