使用 Python 提取文本文件中两个字符串之间的文本答案

【问题标题】：Extract text present in between two strings in a text file using Python使用 Python 提取文本文件中两个字符串之间的文本
【发布时间】：2020-01-29 14:12:46
【问题描述】：

假设我有一个包含以下内容的文本文件：（在原始答案后添加的内容）

    Quetiapine fumarate Drug substance  This document
    Povidone    Binder  USP
    This line doesn't contain any medicine name.
    This line contains Quetiapine fumarate which shouldn't be extracted as it not present at the 
    beginning of the line.
    Dibasic calcium phosphate dihydrate Diluent USP is not present in the csv
    Lactose monohydrate Diluent USNF
    Magnesium stearate  Lubricant   USNF


    Lactose monohydrate, CI 77491   
    0.6
    Colourant
    E 172

    Some lines to break the group.
    Silicon dioxide colloidal anhydrous
    (0.004
    Gliding agent
    Ph Eur

    Adding some random lines.

    Povidone
    (0.2
    Lubricant
    Ph Eur

我有一个 csv，其中包含我想在 .txt 文件中匹配的药物名称列表，并提取 2 个独特药物之间存在的所有数据（当药物名称位于行首时）。（ csv 文件中的药物示例为 'Quetiapine fumarate', 'Povidone', 'Magnesium stearate', 'Lactose monohydrate' etc etc.)

我想迭代文本文件的每一行并创建从一种药物到另一种药物的组。

只有当药物名称出现在换行符的开头并且不在行之间时才会发生这种情况。

预期输出：

['Quetiapine fumarate   Drug substance  This document'],
['Povidone  Binder  USP'],
['Lactose monohydrate   Diluent USNF'],
['Magnesium stearate    Lubricant   USNF'],
[Lactose monohydrate, CI 77491  
    0.6
    Colourant
    E 172],

[Povidone
    (0.2
    Lubricant
    Ph Eur]

有人可以帮我在 Python 中做同样的事情吗？

尝试到现在：

with open('C:/Users/test1.txt', 'r', encoding='utf8') as file:
data = file.read()

medicines = ('Quetiapine fumarate', 'Povidone', 'Magnesium stearate', 'Lactose monohydrate')

result = []
#with open('C:\Users\substancecopy.csv') as f:
for line in data:
    if any(line.startswith(med) for med in medicines):
        result.append(line.strip())

我需要捕获从一种药物到另一种药物的所有文本，如预期输出中所示，这段代码不会发生这种情况

【问题讨论】：

标签： python regex python-3.x string pattern-matching

【解决方案1】：

你可以不用正则表达式，使用str.startswith()：

medicines = ('Quetiapine fumarate', 'Povidone', 'Magnesium stearate', 'Lactose monohydrate')

result = []
with open('C:\Users\substancecopy.csv') as f:
    for line in f:
        if any(line.startswith(med) for med in medicines):
            result.append(line.strip())

我不确定为什么您的预期输出包含带有单个字符串的列表列表，但如果您确实需要使用 result.append([line.strip()])。

【讨论】：

我已经更新了我的代码。 test1.txt 包含上面给出的文本。你是这个意思吗？请检查上面编辑的代码
@gimcarey，每行开头的空格也存在吗？
没有没有。开头没有空格
@gimcarey 我的代码有什么问题？它返回什么结果？
它返回一个空列表。你能检查一下for循环吗？我将文件的内容存储在一个名为 data 的变量中并对其进行迭代。