【问题标题】:How to build the correctly regex to findo words in txt file using python?如何使用python构建正确的正则表达式以在txt文件中查找单词?
【发布时间】:2024-04-26 00:30:01
【问题描述】:

我有一个 txt 文件,我想在那里搜索特定的单词并将其保存在另一个 txt 文件中,其中包含它出现的次数。示例:我想搜索“jardim guanabara”、“jd guanabara”、“jd gb”、“norte”、“zona norte”、“vale dos sonhos”、“asa branca”和“joao paulo”。

这是我到目前为止尝试过的,但我不太清楚如何处理。你们能帮我写正确的正则表达式来找到这个词吗?感谢您的帮助。

[import re

regex = r"((?<=zona )norte\w+|(?<=jardim )guanabara|(?<=jardim )gb\w+)|((?<=joao )paulo\w+|(?<=zn)norte|(?<=gato)dorm\w+)"


with open('file.txt','r') as f: 
    #input_file = f.readlines()

    for line in f:
      x = re.search(regex, line)
      print(x)]

我希望这样的内容保存到另一个 txt 文件中。 1

【问题讨论】:

    标签: regex python-3.x search


    【解决方案1】:

    我猜你可能想设计一个类似于以下的表达式:

    ^(?=.*(?:\bjardim\s+guanabara\b|\bjd\s+guanabara\b|\bjd\s+gb\b|\bnorte\b|\bzona\s+norte\b|\bvale\s+dos\b\s+sonhos\b|\basa\s+branca\b|\bjoao\s+paulo\b)).*$
    

    表达式在regex101.com 的右上方面板中进行了解释,如果您想探索/简化/修改它,在this link 中,您可以查看它如何与一些示例输入进行匹配,如果您愿意的话。

    测试

    import re
    
    regex = r"^(?=.*(?:\bjardim\s+guanabara\b|\bjd\s+guanabara\b|\bjd\s+gb\b|\bnorte\b|\bzona\s+norte\b|\bvale\s+dos\b\s+sonhos\b|\basa\s+branca\b|\bjoao\s+paulo\b)).*$"
    
    test_str = """
    I want to search for this words jardim guanabara.
    I want to search for this words jd guanabara.
    I want to search for this words jd gb.
    I want to search for this words norte.
    I want to search for this words zona norte.
    I want to search for this words vale dos sonhos.
    I want to search for this words asa branca and joao paulo.
    
    I don't want to search for this words nojardim guanabara.
    I don't want to search for this words nojd guanabara.
    I don't want to search for this words nojd gb.
    I don't want to search for this words nonorte.
    I don't want to search for this words nozona norte.
    I don't want to search for this words novale dos sonhos.
    I don't want to search for this words noasa branca and joao paulo.
    """
    
    print(re.findall(regex, test_str, re.M))
    

    输出

    ['I want to search for this words jardim guanabara.', 'I want to search for this words jd guanabara.', 'I want to search for this words jd gb.', 'I want to search for this words norte.', 'I want to search for this words zona norte.', 'I want to search for this words vale dos sonhos.', 'I want to search for this words asa branca and joao paulo.', "I don't want to search for this words nozona norte.", "I don't want to search for this words noasa branca and joao paulo."]
    

    正则表达式电路

    jex.im 可视化正则表达式:

    【讨论】:

    • 非常感谢@Emma。现在对我来说更清楚如何构建正则表达式。我有这个代码:with open('file.txt, 'r') as file: for line in file: for match in re.findall(regex, line): #finditer print(match) 如何将结果保存到另一个 txt 文件 @Emma?再次感谢您的解释,非常清楚:)
    【解决方案2】:

    执行此操作的方法如下(假设您的 .txt 文件名为 in.txt):

    search_terms = [
        "asa branca",
        "joao paulo",
    ]
    
    with open("in.txt") as f:
        text = f.read()
    
        occurence_map = {term: text.count(term) for term in search_terms}
    

    这使用“字典理解”,这是 Python >2.7、>3.0 中引入的一个功能。基本上,它正在构建一个字典:对于我们要搜索的每个术语,使用该术语作为键,并将文本中术语的计数作为值。

    有点不简洁,但你可以用更直接的方式来做到这一点,如下所示:

    with open("in.txt") as f:
        text = f.read()
    
        occurence_map = dict()
    
        for term in search_terms:
            occurence_map[term] = text.count(term)
    

    然后您可以使用您喜欢的格式将其写入文件。例如:

    with open("out.txt", "w") as f:
        for term, count in occurence_map.items():
            f.write("{}: {}\n".format(term, count))
    

    注意:此解决方案仅适用于您希望字符串完全匹配并且不需要用单词边界分隔的情况。也就是说,搜索foo bar时会匹配到以下内容:

    • Somethingfoo barsomething.
    • Something foo bar something.

    ...这些不会:

    • Something foo bar.(多个空格不渲染)
    • foo\tbar
    • Foo bar.
    • foo Bar.

    如果有必要,最好使用正则表达式。如果是这种情况,我可以编辑我的答案。

    【讨论】:

    • 感谢您的回答。我在这里尝试了您教的方式,并且有效,但是,我必须搜索以不同方式编写的字符串。看看: Asa Branca:1 João Paulo:43 João Paulo 2:4 João Paulo II:12 Vera Cruz:14 vera cruz:1 vale dos sonhos:20 Vale dos Sonhosregião norte:0 norte:3 jardim Guanabara:13 jd。 guanabara: 0 Jardim Guanabara: 17 Jardim Guanabara 1: 0 Jardim Guanabara 2: 0 Jardim Guanabara 2: 0 Jardim Guanabara 3: 1 Jardim Guanabara 3: 1 guanabara: 30 我认为使用正则表达式这可能更容易,但我非常新的。
    • 您仍然可以在没有正则表达式的情况下执行此操作,尽管它变得不那么直观/明确。您可以先在text 上调用.lower(),使其全部变为小写,然后手动替换非ASCII 字符(或使用unicodedata 之类的库)。你可能想看看this post
    最近更新 更多