如何将正则表达式应用于整个文件，而不仅仅是一行一行？答案

【问题标题】：How to apply regex to entire file, not just line after line?如何将正则表达式应用于整个文件，而不仅仅是一行一行？
【发布时间】：2023-03-23 20:49:01
【问题描述】：

我想将我的正则表达式不仅应用于文本文件的第一行，而且应用于所有行。目前，它仅在整个适当的匹配位于一行时才匹配。如果适当的匹配在下一行继续 - 它根本不匹配。

 class Parser {
  public static void main(String[] args) throws IOException {

    Pattern patt = Pattern.compile("(include|"
            + "integrate|"
            + "driven based on|"
            + "facilitate through|"
            + "contain|"
            + "using|"
            + "equipped"
            + "integrate|"
            + "implement|"
            + "utilized to facilitate|"
            + "comprise){1}"
            + "[\\s\\w\\,\\(\\)\\;\\:]*\\.");  //Regex
    BufferedReader r = new BufferedReader(new FileReader("E:/test/test.txt")); // read the file


    String line;
    PrintWriter pWriter = null; 
    while ((line = r.readLine()) != null) {           
      Matcher matcher = patt.matcher(line);  
     while (matcher.find()) { 

         try{
             pWriter = new PrintWriter(new BufferedWriter(new FileWriter("E:/test/test1.txt", true)));//append any given input 
             pWriter.println(matcher.group());  //write the result of matcher to the new file
         } catch (IOException ioe) { 
             ioe.printStackTrace(); 
         } finally { 
             if (pWriter != null){ 
                 pWriter.flush(); 

                 pWriter.close(); 
             } 
         }

        System.out.println(matcher.group());   

      }
    }
  }
}

【问题讨论】：

可以提供测试数据吗？您尝试匹配的表达式之间是否有新行？
@Razib：Java 中没有“全局修饰符”，它不需要。但即使在使用它的语言（如 JavaScript 或 Perl）中，它也与这个问题无关。

标签： java regex matcher

【解决方案1】：

将while ((line = r.readLine()) != null) 更改为：

String file = ""; // Basically, a conglomerate of all of the lines in the file
while ((line = r.readLine()) != null) {
    file += line; // Append each line to the "file" string
}
Matcher matcher = patt.matcher(file);
while (matcher.find()) {
    /* Blah blah blah, your outputting goes here. */
}

发生这种情况的原因是因为您正在单独执行每一行。对于您想要的，您需要将正则表达式应用于文件一次。

【讨论】：

谢谢！！你的回答对我有帮助
没问题，乐于助人！ :) 如果这解决了您的问题，我建议您将其标记为答案，以便其他人也可以快速看到。
这个正则表达式的测试数据有点奇怪。当我使用自己输入的文本时，正则表达式会找到我需要的所有内容。但是当我使用从转换后的 PDF 文件到 txt 的文本时，正则表达式只会找到第一个匹配项。我认为它只能读取有限数量的字符
@OlegNekhayenko 好吧，在他给出答案之前，我至少告诉过你三遍......（我删除了 2 个 cmets）。总是一样的......
@maraca：您的回答删除了字符串中的任何 换行符 (\n)，由于readLine() 的功能，这些内容不存在。以您的示例“com\nprise”为例，readLine() 的两个调用将返回“com”和“prise”，但 OP 想要“comprise”，如果您在我的答案中将所有字符串加在一起，就会找到它。跨度>

【解决方案2】：

目前匹配器是按行应用的，它需要应用到整个文件才能按预期工作。

正则表达式是贪婪的，除非您的字符串中有.（或其他特殊字符），否则您将在第一次匹配时匹配整个字符串：

...
        + "comprise){1}"
        + "[\\s\\w\\,\\(\\)\\;\\:]*\\.");  //Regex

在最后一行你匹配任何空格和单词，所以除了.之外几乎没有任何东西。 {1} 和大部分 \ 也是多余的（因为在 [] 中）：

...
        + "comprise)"
        + "[\\s\\w,();:]*\\.");  //Regex

如果您不关心换行符，请先删除它们，它应该可以工作（如果您有类似 "com\nprise" 并想匹配它，我认为没有办法解决）：

s = s.replaceAll("\\n+", "");

【讨论】：

我应该在代码的哪个位置插入 s = s.replaceAll("\\n+", ""); ?
读取数据后，需要读取整个文件，然后在结果字符串中替换，然后应用匹配器。因为您已经摆脱了所有新行，所以您可能希望在输出文件中的每个匹配项之后添加一个 \n
@OlegNekhayenko 从不介意关于 \n 的评论，您使用 PrintWriter 它是自动完成的......其余的都是正确的。