【问题标题】:Using chaos game representation for gene sequences [Python program]对基因序列使用混沌游戏表示[Python程序]
【发布时间】:2017-11-13 18:38:02
【问题描述】:

我需要使用混沌游戏表示来表示许多基因序列 我从 Boštjan Cigan 的博客 (https://bostjan-cigan.com/chaos-game-representation-of-gene-structure-in-python/) 获得了这个 python 代码

作者:Bostjan Cigan

https://bostjan-cigan.com

    import collections
    from collections import OrderedDict
    from matplotlib import pyplot as plt
    from matplotlib import cm

    import pylab
    import math

    f = open("ensemblSeq.fa")
    s1 = f.read()
    data = "".join(s1.split("\n")[1:])

    def count_kmers(sequence, k):
        d = collections.defaultdict(int)
        for i in xrange(len(data)-(k-1)):
            d[sequence[i:i+k]] +=1
        for key in d.keys():
             if "N" in key:
                 del d[key]
        return d

    def probabilities(kmer_count, k):
        probabilities = collections.defaultdict(float)
        N = len(data)
        for key, value in kmer_count.items():
            probabilities[key] = float(value) / (N - k + 1)
        return probabilities

    def chaos_game_representation(probabilities, k):
        array_size = int(math.sqrt(4**k))
        chaos = []
        for i in range(array_size):
            chaos.append([0]*array_size)

        maxx = array_size
        maxy = array_size
        posx = 1
        posy = 1
         for key, value in probabilities.items():
             for char in key:
                 if char == "T":
                    posx += maxx / 2
                 elif char == "C":
                    posy += maxy / 2
                 elif char == "G":
                     posx += maxx / 2
                     posy += maxy / 2
                 maxx = maxx / 2
                 maxy /= 2
             chaos[posy-1][posx-1] = value
             maxx = array_size
             maxy = array_size
             posx = 1
             posy = 1

         return chaos

      f3 = count_kmers(data, 3)
      f4 = count_kmers(data, 4)

      f3_prob = probabilities(f3, 3)
      f4_prob = probabilities(f4, 4)

      chaos_k3 = chaos_game_representation(f3_prob, 3)
      pylab.title('Chaos game representation for 3-mers')
      pylab.imshow(chaos_k3, interpolation='nearest', cmap=cm.gray_r)
      pylab.show()

      chaos_k4 = chaos_game_representation(f4_prob, 4)
      pylab.title('Chaos game representation for 4-mers')
      pylab.imshow(chaos_k4, interpolation='nearest', cmap=cm.gray_r)
      pylab.show()

此代码工作正常,但我有很多序列文件我需要遍历文件夹中的每个 fasta 文件,并获取存储在文件夹中的单个图,其中图像文件的名称对应于 fasta 文件的名称根据我的需要修改代码

我是 python 和 StackOverflow 的新手,如果有任何错误请忽略

提前致谢

【问题讨论】:

  • 请修正您帖子的缩进。这里的代码有点混乱。
  • 我已经修复了缩进是好的还是需要做更多的事情@Franz

标签: python dna-sequence representation chaos


【解决方案1】:

因此,如果您想将代码应用于目录中的每个文件,一个非常简单的方法是调用for-loop 中的所有文件。我建议如下:

import collections
import os
from collections import OrderedDict
from matplotlib import pyplot as plt
from matplotlib import cm

import pylab
import math

def count_kmers(sequence, k):
    d = collections.defaultdict(int)
    for i in xrange(len(data)-(k-1)):
        d[sequence[i:i+k]] +=1
    for key in d.keys():
         if "N" in key:
             del d[key]
    return d

def probabilities(kmer_count, k):
    probabilities = collections.defaultdict(float)
    N = len(data)
    for key, value in kmer_count.items():
        probabilities[key] = float(value) / (N - k + 1)
    return probabilities

def chaos_game_representation(probabilities, k):
    array_size = int(math.sqrt(4**k))
    chaos = []
    for i in range(array_size):
        chaos.append([0]*array_size)

    maxx = array_size
    maxy = array_size
    posx = 1
    posy = 1
    for key, value in probabilities.items():
        for char in key:
            if char == "T":
                posx += maxx / 2
            elif char == "C":
                posy += maxy / 2
            elif char == "G":
                 posx += maxx / 2
                 posy += maxy / 2
            maxx = maxx / 2
            maxy /= 2
        chaos[posy-1][posx-1] = value
        maxx = array_size
        maxy = array_size
        posx = 1
        posy = 1
    return chaos

if __name__ == "__main__":
    PATH = os.getcwd()
    filelist = sorted([os.path.join(PATH, f) for f in os.listdir(PATH) if f.endswith('.fa')])
    for file in filelist:
        f = open(file)
        s1 = f.read()
        data = "".join(s1.split("\n")[1:])
        f3 = count_kmers(data, 3)
        f4 = count_kmers(data, 4)

        f3_prob = probabilities(f3, 3)
        f4_prob = probabilities(f4, 4)

        chaos_k3 = chaos_game_representation(f3_prob, 3)
        pylab.title('Chaos game representation for 3-mers')
        pylab.imshow(chaos_k3, interpolation='nearest', cmap=cm.gray_r)
        pylab.savefig(os.path.splitext(file)[0]+'chaos3.png')
        pylab.show()

        chaos_k4 = chaos_game_representation(f4_prob, 4)
        pylab.title('Chaos game representation for 4-mers')
        pylab.imshow(chaos_k4, interpolation='nearest', cmap=cm.gray_r)
        pylab.savefig(os.path.splitext(file)[0]+'chaos4.png')
        pylab.show()

我只是绕了一个循环并添加了一个pylab.savefig() 调用。此外,我使用os 从您的目录中获取文件名。它现在应该可以工作了。

【讨论】:

  • 我在 maxx = maxx /2 附近收到一个未缩进的错误
  • 缩进有点乱。我修好了它。顺便说一句,我没有改变你的功能。所以你也可以使用原来的缩进。
  • 如果 name == "main",你能解释一下这实际上是做什么的吗?
  • 这只是一个检查,你是否直接执行你的脚本。如果您只是将your_file.pyimport your_file 导入另一个脚本,它只是导入类和函数,而不是main 内部的代码。它使重用代码变得更加简单。我建议你阅读stackoverflow.com/questions/419163/what-does-if-name-main-do
  • ' chaos_k3 = chaos_game_representation(f3_prob, 3) matrix1 = np.array(chaos_k3).reshape(8,8) print matrix1' 我在上面的程序中使用了这段代码来得到数组的输出但是我得到的矩阵为 [[ 0.03061224 0.01391466 0.01113173 0.02319109 0.01669759 0.01020408 0.01669759 0.02690167] (由于空间问题,我得到了 8 条类似的行粘贴一个样本
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2020-02-23
  • 1970-01-01
  • 1970-01-01
  • 2015-06-14
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多