【发布时间】:2017-05-06 09:56:42
【问题描述】:
目标是处理 NLP 中的标记化任务,并将脚本从 Perl script 移植到 Python script。
主要问题是当我们运行分词器的 Python 端口时会出现错误的反斜杠。
在 Perl 中,我们可能需要像这样对单引号和 & 符号进行转义:
my($text) = @_; # Reading a text from stdin
$text =~ s=n't = n't =g; # Puts a space before the "n't" substring to tokenize english contractions like "don't" -> "do n't".
$text =~ s/\'/\'/g; # Escape the single quote so that it suits XML.
将正则表达式从字面上移植到 Python 中
>>> import re
>>> from six import text_type
>>> sent = text_type("this ain't funny")
>>> escape_singquote = r"\'", r"\'" # escape the left quote for XML
>>> contraction = r"n't", r" n't" # pad a space on the left when "n't" pattern is seen
>>> text = sent
>>> for regexp, substitution in [contraction, escape_singquote]:
... text = re.sub(regexp, substitution, text)
... print text
...
this ai n't funny
this ai n\'t funny
& 符号的转义以某种方式将其添加为文字反斜杠 =(
要解决这个问题,我可以这样做:
>>> escape_singquote = r"\'", r"'" # escape the left quote for XML
>>> text = sent
>>> for regexp, substitution in [contraction, escape_singquote]:
... text = re.sub(regexp, substitution, text)
... print text
...
this ai n't funny
this ai n't funny
但在 Python 中看似没有转义单引号,我们也得到了想要的结果:
>>> import re
>>> from six import text_type
>>> sent = text_type("this ain't funny")
>>> escape_singquote = r"\'", r"\'" # escape the left quote for XML
>>> contraction = r"n't", r" n't" # pad a space on the left when "n't" pattern is seen
>>> escape_singquote = r"'", r"'" # escape the left quote for XML
>>> text = sent
>>> for regexp, substitution in [contraction, escape_singquote]:
... text = re.sub(regexp, substitution, text)
... print text
...
this ai n't funny
this ai n't funny
现在这令人费解......
鉴于上述上下文,问题是我们需要在 Python 中转义哪些字符以及在 Perl 中转义哪些字符? Perl 和 Python 中的正则表达式不是等效的吗?
【问题讨论】:
-
您正在使用所有原始字符串。反斜杠是文字。
-
Perl 版本也不需要反斜杠。
标签: python regex perl escaping tokenize