Python：使用 re 模块查找字符串，然后在字符串下打印值答案

【问题标题】：Python: Using re module to find string, then print values under stringPython：使用 re 模块查找字符串，然后在字符串下打印值
【发布时间】：2015-06-14 19:54:45
【问题描述】：

我正在尝试使用 re 模块来搜索一个相当大的文件的字符串。我正在搜索的文件格式如下：

      220
      BOX 1,  STEP 1
      C        15.1760586379       13.7666285127        4.1579861659
      F        13.7752750995       13.3845518556        4.1992254467
      F        15.1122807811       15.0753387163        3.8457966464
      H        15.5298304628       13.5873563855        5.1615910859
      H        15.6594416869       13.1246597008        3.3754112615
        5
     BOX 2,  STEP 1
     C        15.1760586379       13.7666285127        4.1579861659
     F        13.7752750995       13.3845518556        4.1992254467
     F        15.1122807811       15.0753387163        3.8457966464
     H        15.5298304628       13.5873563855        5.1615910859
     H        15.6594416869       13.1246597008        3.3754112615
       240
     BOX 1,  STEP 2
     C        12.6851133069        2.8636250164        1.1788963097
     F        11.7935769268        1.7912366066        1.3042188034
     F        13.7887138736        2.3739304018        0.4126088380
     H        12.1153838312        3.7024696077        0.7164304431
     H        13.0962656950        3.1549047758        2.1436863477
     C        12.6745394723        3.6338848332       15.1374252921
     F        11.8703828307        4.3473226569       16.0480492173
     F        12.2304604843        2.3709059503       14.9433964493
     H        12.6002811971        4.1968554204       14.1449118786
     H        13.7469256153        3.6086212350       15.5204655285

对于 Box 1 和 Box 2，此格式继续使用，每个 BOX 总共约 30000 步。我有利用 re 模块根据关键字“STEP”搜索此文件的代码。不幸的是，当我运行它时它不会产生任何结果。我需要我的代码来搜索 1) ONLY 框 1，然后 2) 将步骤 1 之后开始的所有坐标打印/输出到文件中（最好省略“C's、F's、H's”；所以只有坐标）， 3) 将“STEP”数字增加 48，然后重复 2)。我还想忽略我正在搜索的文件中的“5”和“240”；因此代码应该进行补偿，以便在我们搜索此文件后不包含在输出中。这是我迄今为止所拥有的（它不起作用）：

 import re
 shakes = open("mc_coordinates", "r")
 i = 1
 for line in shakes:
        if re.match("(.*)STEP i(.*)", line):
               print line
        i+=48

这是我的代码要做的一个例子：

  STEP 1
    15.1760586379       13.7666285127        4.1579861659
    13.7752750995       13.3845518556        4.1992254467
    15.1122807811       15.0753387163        3.8457966464
    15.5298304628       13.5873563855        5.1615910859
    15.6594416869       13.1246597008        3.3754112615  
  STEP 49
    12.6851133069        2.8636250164        1.1788963097
    11.7935769268        1.7912366066        1.3042188034
    13.7887138736        2.3739304018        0.4126088380
    12.1153838312        3.7024696077        0.7164304431
    13.0962656950        3.1549047758        2.1436863477
    12.6745394723        3.6338848332       15.1374252921
    11.8703828307        4.3473226569       16.0480492173
    12.2304604843        2.3709059503       14.9433964493
    12.6002811971        4.1968554204       14.1449118786
    13.7469256153        3.6086212350       15.5204655285
  STEP 97
    15.1760586379       13.7666285127        4.1579861659
    13.7752750995       13.3845518556        4.1992254467
    15.1122807811       15.0753387163        3.8457966464
    15.5298304628       13.5873563855        5.1615910859
    15.6594416869       13.1246597008        3.3754112615

需要注意的是，这是一个精简版本，通常在“STEP”数字之间会有大约 250 行坐标。任何想法或想法将不胜感激。谢谢！！

【问题讨论】：

只需将行分成 STEP 出现的部分，然后分别对这些部分进行重新处理。没有？
仅以 48 步为增量。因此，例如，我只想要 BOX 1、STEP 1、BOX 1 STEP 49、BOX 1 STEP 97 等等，直到文件结束.
能不提取步骤号，跳过不想要的吗？
这是itertools.groupby 的完美用例。制作一个生成每个坐标组的生成器，然后过滤您想要的组。这里也不需要正则表达式。
我给的代码已经这样工作了

标签： python regex string

【解决方案1】：

一种快速但可能效率不高的方法是逐行解析并添加一些状态。

# untested code, but i think you get the idea
import re
shakes = open("mc_coordinates", "r")
i = 1
output = False # are we in a block that should be output?
for line in shakes:
    if re.match("(.*)STEP i(.*)", line): # tune this to match only for BOX 1
        print line
        output = true
        i+=48
    elif re.match("(.*)STEP i(.*)", line):
        # some other box or step
        output = false
    elif output:
        print line # or remove the first few chars to get rid of C,F or Hs.

【讨论】：

【解决方案2】：

似乎最简单的方法是使用两种正则表达式模式： 1. 找到“BOX 1，STEP 48N+1”字符串。 2. 获取坐标。

我在下面提供了一些代码。还没有在你的东西上尝试过，但它应该很容易修复错误。基本上，您需要的是一个小型状态机，它会告诉您何时应该和不应该打印出坐标

step_re = re.compile(r'BOX 1,\s+STEP (\d+)')
coord_re = re.compile(r'\s*(\d+.\d+)'*3)
in_step = False
for line in io.open('your_file.txt', rb):
  if in_step:
    coord_match = coord_re.search(line)
    if coord_match:
      print coord_match.group(1), coord_match.group(2), coord_match.group(3)
    else:
      in_step = False
    continue

  step_match = step_re.match(line)
  if step_match and (int(step_match.group(1)) % 48) == 1:
    print 'STEP {}'.format(step_match.group(1))
    in_step = True

【讨论】：