【问题标题】:For-loop to count differences in lines with pythonFor-loop用python计算行的差异
【发布时间】:2026-01-28 15:40:01
【问题描述】:

我有一个文件充满了这样的行(这只是文件的一小部分):

9 Hyphomicrobium facile Hyphomicrobiaceae
9 Hyphomicrobium facile Hyphomicrobiaceae
7 堪萨斯分枝杆菌科分枝杆菌
7 胃分枝杆菌科分枝杆菌
10 Streptomyces olivaceiscleroticus Streptomycetaceae
10 Streptomyces niger Streptomycetaceae
1 Streptomyces geysiriensis Streptomycetaceae
1 Streptomyces minutiscleroticus Streptomycetaceae
0 新生布鲁氏菌 布鲁氏菌科
0 Brucella melitensis Brucellaceae
2 分枝杆菌科分枝杆菌

数字指的是一个簇,然后是“属”“种”“科”。 我想做的是编写一个程序,它会查看每一行并向我报告:每个集群中不同属的列表,以及集群中每个属的数量。所以我对簇号和每行中的第一个“单词”感兴趣。

我的麻烦是我不确定如何获取此信息。我想我需要使用一个for循环,从以'0'开头的行开始。输出将是一个看起来像这样的文件:

Cluster 0: Brucella(2) # 列出簇,然后是簇中的属,带有编号,类似这样。
集群 1:链霉菌(2)
集群 2:布鲁氏菌 (1)
等等。

最终我想对每个集群中的 Families 进行相同的计数,然后将 Genera 和 Species 放在一起。任何关于如何开始的想法将不胜感激!

【问题讨论】:

    标签: python for-loop iteration


    【解决方案1】:

    我认为这将是一个有趣的小玩具项目,所以我写了一个小技巧来从标准输入读取像你这样的输入文件,递归地计算和格式化输出,并吐出看起来有点像你的输出,但是嵌套格式,如下所示:

    Cluster 0:
        Brucella(2)
            melitensis(1)
                Brucellaceae(1)
            neotomae(1)
                Brucellaceae(1)
        Streptomyces(1)
            neotomae(1)
                Brucellaceae(1)
    Cluster 1:
        Streptomyces(2)
            geysiriensis(1)
                Streptomycetaceae(1)
            minutiscleroticus(1)
                Streptomycetaceae(1)
    Cluster 2:
        Mycobacterium(1)
            phocaicum(1)
                Mycobacteriaceae(1)
    Cluster 7:
        Mycobacterium(2)
            gastri(1)
                Mycobacteriaceae(1)
            kansasii(1)
                Mycobacteriaceae(1)
    Cluster 9:
        Hyphomicrobium(2)
            facile(2)
                Hyphomicrobiaceae(2)
    Cluster 10:
        Streptomyces(2)
            niger(1)
                Streptomycetaceae(1)
            olivaceiscleroticus(1)
                Streptomycetaceae(1)
    

    我还在我的表中添加了一些垃圾数据,这样我就可以在集群 0 中看到一个额外的条目,与其他两个分开...这里的想法是您应该能够看到*“集群”条目然后是属、种、科的嵌套、缩进条目……我希望它也不难扩展到更深的树。

    # Sys for stdio stuff
    import sys
    # re for the re.split -- this can go if you find another way to parse your data
    import re
    
    
    # A global (shame on me) for storing the data we're going to parse from stdin
    data = []
    
    # read lines from standard input until it's empty (end-of-file)
    for line in sys.stdin:
        # Split lines on spaces (gobbling multiple spaces for robustness)
        # and trim whitespace off the beginning and end of input (strip)
        entry = re.split("\s+", line.strip())
    
        # Throw the array into my global data array, it'll look like this:
        # [ "0", "Brucella", "melitensis", "Brucellaceae" ]
        # A lot of this code assumes that the first field is an integer, what
        # you call "cluster" in your problem description
        data.append(entry)
    
    # Sort, first key is expected to be an integer, and we want a numerical
    # sort rather than a string sort, so convert to int, then sort by
    # each subsequent column. The lamba is a function that returns a tuple
    # of keys we care about for each item
    data.sort(key=lambda item: (int(item[0]), item[1], item[2], item[3]))
    
    
    # Our recursive function -- we're basically going to treat "data" as a tree,
    # even though it's not.
    # parameters:
    #    start - an integer telling us what line to begin working from so we needn't
    #            walk the whole tree each time to figure out where we are.
    #    super - An array that captures where we are in the search. This array
    #            will have more elements in it as we deepen our traversal of the "tree"
    #            Initially, it will be []
    #            In the next ply of the tree, it will be [ '0' ]
    #            Then something like [ '0', 'Brucella' ] and so on.
    #    data -  The global data structure -- this never mutates after the sort above,
    #            I could have just used the global directly
    def groupedReport(start, super, data):
        # Figure out what ply we're on in our depth-first traversal of the tree
        depth =  len(super)
        # Count entries in the super class, starting from "start" index in the array:
        count = 0
    
        # For the few records in the data file that match our "super" exactly, we count
        # occurrences.
        if depth != 0:
            for i in range(start, len(data)):
                if (data[i][0:depth] == data[start][0:depth]):
                    count = count + 1
                else:
                    break; # We can stop counting as soon as a match fails,
                       # because of the way our input data is sorted
        else:
            count = len(data)
    
    
        # At depth == 1, we're reporting about clusters -- this is the only piece of
        # the algorithm that's not truly abstract, and it's only for presentation
        if (depth == 1):
            sys.stdout.write("Cluster " + super[0] + ":\n")
        elif (depth > 0):
            # Every other depth: indent with 4 spaces for every ply of depth, then
            # output the unique field we just counted, and its count
            sys.stdout.write((' ' * ((depth - 1) * 4)) +
                             data[start][depth - 1] + '(' + str(count) + ')\n')
    
        # Recursion: we're going to figure out a new depth and a new "super"
        # and then call ourselves again. We break out on depth == 4 because
        # of one other assumption (I lied before about the abstract thing) I'm
        # making about our input data here. This could
        # be made more robust/flexible without a lot of work
        subsuper = None
        substart = start
        for i in range(start, start + count):
            record = data[i]  # The original record from our data
            newdepth = depth + 1
            if (newdepth > 4): break;
    
            # array splice creates a new copy
            newsuper = record[0:newdepth]
            if newsuper != subsuper:
                # Recursion here!
                groupedReport(substart, newsuper, data)
                # Track our new "subsuper" for subsequent comparisons
                # as we loop through matches
                subsuper = newsuper
    
            # Track position in the data for next recursion, so we can start on
            # the right line
            substart = substart + 1
    
    # First call to groupedReport starts the recursion
    groupedReport(0, [], data)
    

    如果您将我的 Python 代码制作成“classifier.py”之类的文件,那么您可以像这样通过它运行您的 input.txt 文件(或您所称的任何文件):

    cat input.txt | python classifier.py
    

    如果您愿意的话,递归的大部分魔力都是使用数组切片实现的,并且很大程度上依赖于比较数组切片的能力,以及我可以使用我的排序例程对输入数据进行有意义的排序这一事实。如果大小写不一致可能导致不匹配,您可能希望将输入数据转换为全小写。

    【讨论】:

    • 一个简单的问题,当我运行它时,它出现在顶部:布鲁氏菌科(28)(为什么将布鲁氏菌科(28)放在顶部?)
    • 我可能有错误;如果你想给我你的真实意见,我会整理出来。
    • 真正的输入是我拥有的一个大文件,可以分享吗?您愿意简单地解释一下这段代码是如何工作的吗?仅仅因为它非常适合我正在做的事情,并且我想确保我了解其中发生的事情,以便将来可以以不同的方式使用它!再次感谢您! :D
    • 我在问,这样我就可以理解如何更改代码,使其仅查看属或仅查看家庭。
    • 好的,@Jen,我已经添加了很多 cmets,请随时提问。我的猜测是您的输入文件在第一行没有“集群”整数,因此它与我的假设相冲突,如新 cmets 中所述。告诉我。
    【解决方案2】:

    这很容易做到。

    1. 创建一个空字典{} 来存储您的结果,我们称之为“结果”
    2. 逐行循环数据。
    3. 根据您的结构在空间上分割线以获得 4 个元素,cluster,genus,species,family

    4. 在当前循环中找到每个簇键时,增加每个簇键中的属计数,但第一次出现时必须将它们设置为 1。

    result = { '0': { 'Brucella': 2} ,'1':{'Streptomyces':2}..... }

    代码:

    my_data = """9 Hyphomicrobium facile Hyphomicrobiaceae                                                   
    9 Hyphomicrobium facile Hyphomicrobiaceae                                                                
    7 Mycobacterium kansasii Mycobacteriaceae                                                                
    7 Mycobacterium gastri Mycobacteriaceae                                                                  
    10 Streptomyces olivaceiscleroticus Streptomycetaceae                                                    
    10 Streptomyces niger Streptomycetaceae                                                                  
    1 Streptomyces geysiriensis Streptomycetaceae                                                            
    1 Streptomyces minutiscleroticus Streptomycetaceae                                                       
    0 Brucella neotomae Brucellaceae                                                                         
    0 Brucella melitensis Brucellaceae                                                                       
    2 Mycobacterium phocaicum Mycobacteriaceae"""
    
    result = {}
    for line in my_data.split("\n"):
        cluster,genus,species,family = line.split(" ")
        result.setdefault(cluster,{}).setdefault(genus,0)
        result[cluster][genus] += 1
    
    print(result)
    
    
    {'10': {'Streptomyces': 2}, '1': {'Streptomyces': 2}, '0': {'Brucella': 2}, '2': {'Mycobacterium': 1}, '7': {'Mycobacterium': 2}, '9': {'Hyphomicrobium': 2}}
    

    【讨论】:

    • 感谢您的回答!将每一行分成四个元素会破坏字典吗? (我可能在问一个愚蠢的问题)
    • 谢谢!不过,我一直收到此错误,Traceback(最近一次调用最后一次):文件“sort.py”,第 7 行,在 for line in result.split("\n"): AttributeError: 'dict' object has no属性“拆分”