从 Perl 到 Python 的正则表达式翻译答案

【问题标题】：Regex translation from Perl to Python从 Perl 到 Python 的正则表达式翻译
【发布时间】：2026-01-03 09:45:01
【问题描述】：

我想将一个小的 Perl 程序重写为 Python。我正在用它处理文本文件，如下所示：

输入：

00000001;Root;;
00000002;  Documents;;
00000003;    oracle-advanced_plsql.zip;file;
00000004;  Public;;
00000005;  backup;;
00000006;    20110323-JM-F.7z.001;file;
00000007;    20110426-JM-F.7z.001;file;
00000008;    20110603-JM-F.7z.001;file;
00000009;    20110701-JM-F-via-summer_school;;
00000010;      20110701-JM-F-yyy.7z.001;file;

期望的输出：

00000001;;Root;;
00000002;  ;Documents;;
00000003;    ;oracle-advanced_plsql.zip;file;
00000004;  ;Public;;
00000005;  ;backup;;
00000006;    ;20110323-JM-F.7z.001;file;
00000007;    ;20110426-JM-F.7z.001;file;
00000008;    ;20110603-JM-F.7z.001;file;
00000009;    ;20110701-JM-F-via-summer_school;;
00000010;      ;20110701-JM-F-yyy.7z.001;file;

这里是有效的 Perl 代码：

#filename: perl_regex.pl
#/usr/bin/perl -w
while(<>) {                                                           
  s/^(.*?;.*?)(\w)/$1;$2/;                                            
  print $_;                                                           
}

它从命令行调用它：perl_regex.pl input.txt

Perl 风格正则表达式的解释：

s/        # start search-and-replace regexp
  ^       # start at the beginning of this line
  (       # save the matched characters until ')' in $1
    .*?;  # go forward until finding the first semicolon
    .*?   # go forward until finding... (to be continued below)
  )
  (       # save the matched characters until ')' in $2
    \w    # ... the next alphanumeric character.
  )
/         # continue with the replace part
  $1;$2   # write all characters found above, but insert a ; before $2
/         # finish the search-and-replace regexp.

谁能告诉我，如何在 Python 中获得相同的结果？特别是对于 $1 和 $2 变量，我找不到类似的东西。

【问题讨论】：

标签： python regex perl migration

【解决方案1】：

python 正则表达式中 s/pattern/replace/ 的替换指令是 re.sub(pattern, replace, string) 函数，或 re.compile(pattern).sub(replace, string)。在你的情况下，你会这样做：

_re_pattern = re.compile(r"^(.*?;.*?)(\w)")
result = _re_pattern.sub(r"\1;\2", line)

请注意，$1 变为 \1。至于 perl，您需要以您想要的方式迭代您的行（打开、输入文件、分割线，...）。

【讨论】：

我多次看到这个：这个正则表达式的编译是关于什么的？它只是将字符串保存到变量中还是以某种方式优化正则表达式？
@royskatt 在执行正则表达式之前，它会被解析、优化并转换为有效的指令。如果您不手动在Python中编译它，它将在每个匹配时编译，这是不必要的开销。 Perl 的正则表达式文字大多隐藏了这一点。
我不太了解正则表达式，但我相信这会将模式转换为非有限自动机。这是您第一次调用 re.sub 而不是 compiled_reg.sub() 时将执行的步骤，然后它将存储在缓存中，以便后续调用更快。我养成了编译正则表达式的习惯，我想这可以节省每次迭代时的缓存搜索。
@Cilyan，根据re.compile documentation NOTE，仅缓存通过re.search、re.match、re.compile 的模式。
@falsetru，我对你的通知感到惊讶，所以我稍微研究了一下，实际上缓存也适用于 sub，至少从 2.6 开始（我之前没看过）：@987654322 @我想这是有道理的；）

【解决方案2】：

Python 正则表达式与 Perl 的非常相似，除了：

在 Python 中没有正则表达式文字。应该用字符串来表示。我在下面的代码中使用了r'raw string literal'。
反向引用表示为\1、\2、.. 或\g<1>、\g<2>、..
...

使用re.sub 替换。

import re
import sys

for line in sys.stdin: # Explicitly iterate standard input line by line
    # `line` contains trailing newline!
    line = re.sub(r'^(.*?;.*?)(\w)', r'\1;\2', line)
    #print(line) # This print trailing newline
    sys.stdout.write(line) # Print the replaced string back.

【讨论】：

酷！这几乎与 Perl 版本一样紧凑和可读。适合我的大脑！只是 \1 和 \2 中的反斜杠对我来说看起来并不像 pythonic，但我是 python 的新手。感谢您的回答！
@royskatt - Perl re 可以写成s/;\s*\K(?=\w)/;/，使用\Keep（全部在左边）断言和积极的前瞻，只替换第一个实例。一个接近的 Python 等效项是 re.sub(r'(;\s*)(?=\w)', r'\1;', line, 1)，它避免了一次捕获并将替换限制为一次。
@Kenosis：所以 Python 中没有真正的 PCRE，对吧？我发现的唯一想法是github.com/awahlig/python-pcre，但这对我来说看起来不太正式。
@royskatt - 见tchrist's response 到Perl Compatible Regular Expression (PCRE) in Python。