Python 仅在文本文件中的特定位置执行操作答案

【问题标题】：Python perform actions only at certain locations in text filePython 仅在文本文件中的特定位置执行操作
【发布时间】：2015-04-07 03:36:05
【问题描述】：

我有一个包含这样数据的文本文件

AA 331             
line1 ...   
line2 ...    
% information here     
AA 332   
line1 ...    
line2 ...    
line3 ...   
%information here    
AA 1021   
line1 ...   
line2 ...  
% information here      
AA 1022    
line1 ...   
% information here     
AA 1023    
line1 ...    
line2 ...    
% information here

我只想对位于 "AA 331" 和 "AA 1021" 行之后而不是 "AA 332" 、 "AA 1022" 和 "AA 1023" 行之后的最小整数之后的“信息”执行操作。

P.s 这只是大文件的示例数据

下面的代码我尝试解析文本文件并获取列表“list1”中“AA”之后的整数，在第二个函数中我将它们分组以获得“list2”中的最小值。这将返回像 [331,1021,...] 这样的整数。所以我想提取“AA 331”之后的行并执行操作，但我不知道如何继续。

from itertools import groupby
def getlineindex(textfile):
    with open(textfile) as infile:
    list1 = []
    for line in infile :
        if line.startswith("AA"):
            intid = line[3:]
            list1.append(intid)
    return list1

def minimalinteger(list1):
     list2 = []
     for k,v in groupby(list1,key=lambda x: x//10):
           minimalint = min(v)
           list2.append(minimalint)
     return list2

list2 包含 "AA" [331,1021,..] 之后的最小整数

【问题讨论】：

我认为您的问题需要澄清一下。您指定的行之后的“最小整数”是多少？它发生在哪里，该位置是否一致/可靠？此外，您是如何提出“AA 331”和“AA 1021”作为您希望处理的数据的指标的？这是您期望从人类那里接收到的输入，还是有办法通过计算确定它？
最小整数我的意思是 331
您当然会注意到 331
好的，知道了。它们是 10 的整数块。所以它们是随机生成的，但间隔为 10。所以 332 是 331 的副本，而 1022 -1024 是 1021 的副本，所以我想保留块 331 和 1021。[块我的意思是从AA 331 行直到 AA 332 行之前的 % 信息]
@Danira，如果一组 10 个中间有一个随机间隙怎么办？那么，如果你有 300,301,302,305,306,307，我们应该处理 300 和 305 吗？（对不起，我在这里反对边缘情况，但我认为有必要为您提供所需的帮助）

标签： python regex parsing

【解决方案1】：

你可以使用类似的东西：

import re

matcher = re.compile("AA ([\d]+)")
already_was = []
good_block = False

with open(filename) as f:
   for line in f:
        m = matcher.match(line)
        if m:
           v = int(m.groups(0)) / 10
        else:
           v = None

        if m and v not in already_was:
            good_block = True
            already_was.append(m)
        if m and v in already_was:
            good_block = False
        if not m and good_block:
            do_action()

这些代码仅在组中的第一个值是最小值时才有效。

【讨论】：

是的，我在您回答后编辑了我的问题。非常感谢您的帮助:-)

【解决方案2】：

好的，这是我的解决方案。在高层次上，我逐行查看 AA 行以了解我何时找到数据块的开始/结束，并查看我称之为运行号的内容以了解我们是否应该处理下一个块.然后，我有一个处理任何给定块的子例程，基本上读取所有相关行并在需要时处理它们。该子例程用于监视 next AA 行，以便知道它何时完成。

import re

runIdRegex = re.compile(r'AA (\d+)')

def processFile(fileHandle):
    lastNumber = None  # Last run number, necessary so we know if there's been a gap or if we're in a new block of ten.
    line = fileHandle.next()
    while line is not None:  # None is being used as a special value indicating we've hit the end of the file.
        processData = False
        match = runIdRegex.match(line)
        if match:
            runNumber = int(match.group(1))
            if lastNumber == None:
                # Startup/first iteration
                processData = True
            elif runNumber - lastNumber == 1:
                # Continuation, see if the tenths are the same.
                lastNumberTens = lastNumber / 10
                runNumberTens = runNumber / 10
                if lastNumberTens != runNumberTens:
                    processData = True
            else:
                processData = True

            # Always remember where we were.
            lastNumber = runNumber

            # And grab and process data.
            line = dataBlock(fileHandle, process=processData)
        else:
            try:
                line = fileHandle.next()
            except StopIteration:
                line = None

def dataBlock(fileHandle, process=False):
    runData = []
    try:
        line = fileHandle.next()
        match = runIdRegex.match(line)
        while not match:
            runData.append(line)
            line = fileHandle.next()
            match = runIdRegex.match(line)
    except StopIteration:
        # Hit end of file
        line = None

    if process:
        # Data processing call here
        # processData(runData)
        pass

    # Return line so we don't lose it!
    return line

给你一些注意事项。首先，我同意 Jimilian 的观点，即您应该使用正则表达式来匹配 AA 行。

其次，我们谈到的关于何时处理数据的逻辑在 processFile 中。特别是这些行：

        processData = False
        match = runIdRegex.match(line)
        if match:
            runNumber = int(match.group(1))
            if lastNumber == None:
                # Startup/first iteration
                processData = True
            elif runNumber - lastNumber == 1:
                # Continuation, see if the tenths are the same.
                lastNumberTens = lastNumber / 10
                runNumberTens = runNumber / 10
                if lastNumberTens != runNumberTens:
                    processData = True
            else:
                processData = True

我假设我们不想处理数据，然后确定我们何时这样做。从逻辑上讲，您可以反过来做，假设您想要处理数据，然后确定何时不处理。接下来，我们需要存储 last 运行的值，以便知道我们是否需要处理此运行的数据。（并注意第一次运行的边缘情况）我们知道我们想要在序列被破坏时处理数据（两次运行之间的差异大于 1），这由 else 语句处理。我们也知道，当序列增加十位的数字时，我们要处理数据，这通过我的整数除以 10 来处理。

第三，注意从dataBlock返回的数据。如果不这样做，您将丢失导致 dataBlock 停止迭代的 AA 行，而 processFile 需要该行来知道是否应该处理下一个数据块。

最后，我选择使用 fileHandle.next() 和异常处理来识别何时到达文件末尾。但不要认为这是唯一的方法。 :)

如果您有任何问题，请在 cmets 中告诉我。

【讨论】：

完美解释我的问题。很好的解决方案。我将使用我的示例数据并让您知道。无论如何，这就是我想要接受的。非常感谢您的宝贵时间:-)