提取字符串之间的特定文本答案

【问题标题】：Extracting specific text between strings提取字符串之间的特定文本
【发布时间】：2020-08-09 04:40:54
【问题描述】：

我正在尝试从 .txt 文件中提取与 7 个特定设备 (0-6) 相对应的特定行，然后对该数据进行操作。

这是一个例子：

从一个非常大的文件中，我提取了一个事件（这里是 169139），其中包含来自 7 个设备中的 6 个的信息（这里只有 1、2、3、4、5、6，因为设备 0 没有数据）。对于每个这样的事件，我不知道有多少设备将激活作为它们的输出。可以是全部，也可以不是，也可以是一些。

=== 169139 ===
Start: 4.80374e+19
End:   4.80374e+19
--- 1 ---
Pix 9, 66
--- 2 ---
Pix 11, 31
Pix 12, 31
--- 3 ---
Pix 17, 53
Pix 16, 53
Pix 16, 54
--- 4 ---
Pix 44, 64
--- 5 ---
Pix 49, 133
Pix 48, 133
--- 6 ---
Pix 109, 143
Pix 108, 143
Pix 108, 144 
Pix 109, 144

事件很容易迭代，我可以选择屏幕上的全部信息，直到下一个（这里，.txt 的下一行是 === 169140 ===）。

我可以使用以下代码从特定设备中提取信息：

def start_stop_plane (list, dev):
    start_reading = [i for i in range(len(list)) if list[i] == "--- " + str(dev) + " ---"][0]
    stop_reading = [i for i in range(len(list)) if list[i] == "--- " + str(int(dev)+1) + " ---"][0]
    return list[start_reading:stop_reading]

这里，list 是第一个代码注释（完整的事件）。它是以与上述代码类似的方式生成的列表，将 --- 与 === 字符串出现（即事件之间的标志）交换。

我的问题：这适用于从 0 到 5 的所有内容。对于 6，它会崩溃，因为没有 int(dev)+1。我尝试在stop_reading 中添加or 以识别=== 的出现，但它不起作用。

在这种情况下，如何发出列表结束信号并确保不会丢失任何设备？

【问题讨论】：

变量dev 似乎在您的代码中没有定义。我想你想用dev 替换plane。
@LydiavanDyke 该函数实际上将 list 和 dev 作为输入 :) 那架飞机是一些旧代码残余。

标签： python list for-loop

【解决方案1】：

你应该准备好你的“--- plane ---”标记，让python使用in和.index等基本函数为你找到它。

要使数据行的子集到达下一个标记，您可以使用来自 itertools 的takewhile：

data="""=== 169139 ===
Start: 4.80374e+19
End:   4.80374e+19
--- 1 ---
Pix 9, 66
--- 2 ---
Pix 11, 31
Pix 12, 31
--- 3 ---
Pix 17, 53
Pix 16, 53
Pix 16, 54
--- 4 ---
Pix 44, 64
--- 5 ---
Pix 49, 133
Pix 48, 133
--- 6 ---
Pix 109, 143
Pix 108, 143
Pix 108, 144 
Pix 109, 144""".split("\n")

from itertools import takewhile
def planeData(data,plane):
    marker = f"--- {plane} ---"
    if marker not in data: return []
    start = data.index(marker)+1
    return list(takewhile(lambda d:not d.startswith("---"),data[start:]))

输出：

for line in planeData(data,0): print(line)
# nothing printed

for line in planeData(data,5): print(line)
# Pix 49, 133
# Pix 48, 133

for line in planeData(data,6): print(line)
# Pix 49, 133
# Pix 48, 133
# Pix 109, 143
# Pix 108, 143
# Pix 108, 144 
# Pix 109, 144

【讨论】：

嗨阿兰！感谢您回复此问题！我无法使代码正常工作，因为它抱怨该列表在函数的返回中不可调用。采用一组而不是一个列表是可行的，但它带来了更多的复杂性。你是如何让它工作的？
稍后编辑：列表不可调用的问题似乎只与 Jupyter Notebooks 有关。从 shell 运行 python 没有这样的问题。
我可能是由于 Python 本身的版本不同。格式字符串f"--- {plane} ---" 出现在 Python 3 中，Python 2.x 不支持
你是对的。这是一个版本问题。我没有计算出一件事。假设另一个事件=== abc === 包含在文件中，就在上面数据的最后一行之后。示例：data="""=== 169139 === --- 1 --- Pix 9, 66 --- 6 --- Pix 108, 144 Pix 108, 144 Pix 109, 144 === 169140 === Start: 4.80374e+19 End: 4.80374e+19""".split("\n") 如果检测到===，则只有在出现另一个--- 时才会停止拍摄。在这种情况下，我如何才能发出事件结束的信号？
您可以将not d.startswith("---") 替换为not d.replace("=","-").startswith("---") 或d[:3] not in ["---","==="]

【解决方案2】：

你可以使用字符串Index

代码

def start_stop_dev(lst, dev):
    " Assume you meant dev rather than plane "
    try:
      start_reading = lst.index("--- " + str(dev) + " ---")
    except:
      return ""   # No device

    try:
      stop_reading = lst.index("--- " + str(dev+1) + " ---") - 1
    except:
      stop_reading = len(lst)

    if start_reading:
        return lst[start_reading:stop_reading]
    else:
      return None  # not really possible since return "" earlier

测试

lst= """=== 169139 ===
Start: 4.80374e+19
End:   4.80374e+19
--- 1 ---
Pix 9, 66
--- 2 ---
Pix 11, 31
Pix 12, 31
--- 3 ---
Pix 17, 53
Pix 16, 53
Pix 16, 54
--- 4 ---
Pix 44, 64
--- 5 ---
Pix 49, 133
Pix 48, 133
--- 6 ---
Pix 109, 143
Pix 108, 143
Pix 108, 144 
Pix 109, 144"""

# Retrieve and print data for each device
print('----------------Individual Device String Info-------------')
for dev in range(7):
  print(f'device {dev}\n{start_stop_dev(lst, dev)}')

print('----------------Splits of String Info----------------------')
for dev in range(7):
  dev_lst = start_stop_dev(lst,dev).split("\n")
  print(f'dev {dev}: {dev_lst}')

输出 ----------------单独的设备字符串信息-------------

device 0

device 1
--- 1 ---
Pix 9, 66
device 2
--- 2 ---
Pix 11, 31
Pix 12, 31
device 3
--- 3 ---
Pix 17, 53
Pix 16, 53
Pix 16, 54
device 4
--- 4 ---
Pix 44, 64
device 5
--- 5 ---
Pix 49, 133
Pix 48, 133
device 6
--- 6 ---
Pix 109, 143
Pix 108, 143
Pix 108, 144 
Pix 109, 144
----------------Splits of String Info----------------------
dev 0: ['']
dev 1: ['--- 1 ---', 'Pix 9, 66']
dev 2: ['--- 2 ---', 'Pix 11, 31', 'Pix 12, 31']
dev 3: ['--- 3 ---', 'Pix 17, 53', 'Pix 16, 53', 'Pix 16, 54']
dev 4: ['--- 4 ---', 'Pix 44, 64']
dev 5: ['--- 5 ---', 'Pix 49, 133', 'Pix 48, 133']
dev 6: ['--- 6 ---', 'Pix 109, 143', 'Pix 108, 143', 'Pix 108, 144 ', 'Pix 109, 144']

【讨论】：

嗨！谢谢您的回答！我注意到我需要lst[start_reading:stop_reading-1]，以摆脱多余的空行。我得到这个的原因是因为我使用start_stop_dev(lst,1).split('\n') 来获取一个列表，以便进一步处理里面的这些值。当然，这不适用于没有条目的情况，例如此处的 0。知道如何克服这个问题吗？
@nyw--检查代码更新。 lst[start_reading:stop_reading-1] 不太正确，因为它不能在最后一个设备上工作（即在 144 中删除 4），但更正的代码应该可以处理这个问题。我展示了lst[start_reading:stop_reading]) 和 lst[start_reading:stop_reading]).split('\n') 的结果。我还枚举了设备 0 到 7 以显示丢失设备的输出。
现在效果很好！我现在正在尝试不同的测试用例。非常感谢！