【问题标题】:python3 extract string between two strings in a txt filepython3在txt文件中的两个字符串之间提取字符串
【发布时间】:2017-12-28 18:14:39
【问题描述】:

我是 Python 新手。我正在尝试从一个 txt 文件(“infile.txt”)中提取一个字符串(“我们的披露控制自始至终有效”)。该文件比较大,我需要在一个特定部分(“ITEM & nbsp;9A”和“ITEM & nbsp;9B”之间)查找上述字符串。此类部分的示例如下:

</A>ITEM&nbsp;9A. CONTROLS AND PROCEDURES. </B></FONT></P> <P STYLE="margin-top:6px;margin-bottom:0px"><FONT STYLE="font-family:Times New Roman" SIZE="2"><B>Evaluation of Disclosure Controls and Procedures </B></FONT> STYLE="margin-top:6px;margin-bottom:0px; text-indent:4%"><FONT STYLE="font-family:Times New Roman" SIZE="2">Under the supervision and with the participation of our management, including our Chief Executive Officer and Chief Financial Officer, we conducted an evaluation of the effectiveness of our disclosure controls and procedures (as defined in Rules 13a-15(e) and 15d-15(e) under the Securities Exchange Act of 1934, as amended (Exchange Act)), as of the end of the period covered by this Annual Report on Form 10-K. Management recognizes that any controls and procedures, no matter how well designed and operated, can provide only reasonable assurance of achieving their objectives and management necessarily applies its judgment in evaluating the cost-benefit relationship of possible controls and procedures. Based on such evaluation, our Chief Executive Officer and Chief Financial Officer concluded that our disclosure controls and procedures were effective as of September&nbsp;28, 2012. </FONT></P> <P STYLE="margin-top:18px;margin-bottom:0px"><FONT STYLE="font-family:Times New Roman" SIZE="2"><B>Management&#146;s Annual Report on Internal Control over Financial Reporting </B></FONT> <P STYLE="margin-top:6px;margin-bottom:0px; text-indent:4%"><FONT STYLE="font-family:Times New Roman" SIZE="2">This Annual Report does not include a report of management&#146;s assessment regarding internal control over financial reporting or an attestation report of the company&#146;s registered public accounting firm due to a transition period established by rules of the Securities and Exchange Commission for newly public companies. </FONT> <P STYLE="margin-top:18px;margin-bottom:0px"><FONT STYLE="font-family:Times New Roman" SIZE="2"><B>Changes in Internal Control over Financial Reporting </B></FONT></P> <P STYLE="margin-top:6px;margin-bottom:0px; text-indent:4%"><FONT STYLE="font-family:Times New Roman" SIZE="2">There were no changes in our internal control over financial reporting (as defined in Rule&nbsp;13a-15(f) under the Exchange Act) during the quarter ended September&nbsp;28, 2012, that have materially affected, or are reasonably likely to materially affect, our internal control over financial reporting. </FONT> <P STYLE="margin-top:18px;margin-bottom:0px"><FONT STYLE="font-family:Times New Roman" SIZE="2"><B><A NAME="tx431171_16"></A>ITEM&nbsp;9B. OTHER INFORMATION.

如果该部分具有所需的字符串“断定我们的披露控制自生效之日起”(上述部分大约在中间),那么我想在单独的“输出”中打印一个“1”。 csv”文件,如果没有,打印“未找到”。部分的起点并不总是与线的起点重合。很抱歉,但不知道如何开始......我使用的是 Python 3.6。

非常感谢您!

【问题讨论】:

    标签: python html regex python-3.x parsing


    【解决方案1】:

    你可以使用re.findall:

    import re
    
    the_data = re.findall("</A>ITEM&nbsp;9A. (.*?)</B>", string_data_from_file)
    
    if len(the_data) >0:
        print "1"
    
    else:
        print "Not found"
    

    【讨论】:

    • 谢谢 Ajax 1234。它似乎仍然无法按照您的方式工作。我不明白你为什么问 re.findall("ITEM 9A. (.*?)")。不应该是 re.findall("ITEM 9A. (.*?)ITEM 9B.") 吗?即使这样,它也没有在这两个部分之间找到所需的字符串...顺便说一句,在您的“string_data_from_file”中,我插入了我想要的字符串“得出的结论是我们的披露控制是有效的”。希望这没问题。再次感谢。
    【解决方案2】:

    您可以使用regular expressions 来提取给定开瓶器和关卡器之间的文本:

    import re
    
    opener = re.escape(r"ITEM&nbsp;9A")
    closer = re.escape(r"ITEM&nbsp;9B")
    

    您可以使用 re.finditer 查看提取,然后使用 in-operator 过滤带有目标字符串的提取:

    target_string = "concluded that our disclosure controls were effective as of"
    for mo in re.finditer(opener + '(.*?)' + closer, inputstring, re.DOTALL):
        extract = mo.group(1)
        if target_string in extract:
            ...
    

    希望这足以让你开始:-)

    【讨论】:

    • 感谢雷蒙德·赫廷格。似乎它并没有给出任何东西。顺便说一句,我在 re.DOTALL 之前用“target_string”替换了你的“s”。我怀疑这就是你的意思。在“提取中的目标字符串:”之后,我要求一个简单的“打印(提取)”并报告任何内容......知道发生了什么吗?
    • Raymond 不打算将target_string 作为finditer 的参数,而是希望您的输入字符串存储在名为s 的变量中。我将代码编辑为阅读inputstring,以澄清这一点。此外,openercloser 模式中有几个多余的空格,我将其删除。尽管如此,此代码仍找不到任何匹配项,但这仅仅是因为您正在寻找的字符串(“得出结论认为我们的披露控制有效”)没有出现在您在问题中提供的输入字符串中。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2013-01-31
    • 2013-12-11
    • 1970-01-01
    • 2017-04-29
    相关资源
    最近更新 更多