为什么这个正则表达式拆分返回的组件比预期的要多？答案

【问题标题】：Why does this regex split return more components than expected?为什么这个正则表达式拆分返回的组件比预期的要多？
【发布时间】：2014-07-10 00:14:22
【问题描述】：

因此，下面的正则表达式（用 python3 编写）只是将添加到更大的正则表达式中的一部分，以将 url 拆分为模式、域和路径。这部分是提取路径。

link = "http://google.com/whatever/who/jx.html"
components = re.split(r'(?<![:/])(/.*$)', link)

返回以下内容：

['http://google.com', '/whatever/who/jx.html', '']

为什么正则表达式会在列表末尾返回一个额外的元素？

【问题讨论】：

不是您问题的直接答案，但不要使用正则表达式解析网址。使用urllib.parse。

标签： python regex

【解决方案1】：

'(?<![:/])(/.*$)' matches '/whatever/who/jx.html' in your string. 因此，您的字符串被拆分为匹配前的内容、匹配本身和匹配后的内容。你会得到这些元素（匹配在方括号中表示）：

'http://google.com'['/whatever/who/jx.html']''

因此最终得到的数组：

['http://google.com', '/whatever/who/jx.html', '']

指定者：
https://docs.python.org/2/library/stdtypes.html#str.split

【讨论】：

我仍然不太了解这种行为，据我所知，初始拆分后没有模式/.*$
(/.*$) 组匹配斜线（字面意思），然后是任意数量的任何内容，然后是行尾锚。由于正则表达式确保匹配锚定到行尾，因此始终在行尾附近找到匹配。
@CommuSoft 啊哈明白了，你应该发布问题的答案，因为它为我关闭了它。非常感谢。

【解决方案2】：

它认为最好在这里使用re.match 和稍微不同的模式：

>>> import re
>>> link = "http://google.com/whatever/who/jx.html"
>>> re.match("(https?://.+?)(/.*$)", link).groups()
('http://google.com', '/whatever/who/jx.html')
>>>

下面是上面使用的正则表达式匹配的细分：

(        # The start of the first capture group
http     # http
s?       # An optional s
://      # ://
.+?      # One or more characters matched non-greedily
)        # The close of the first capture group
(        # The start of the second capture group
/        # /
.*       # Zero or more characters
$        # The end of the string
)        # The close of the second capture group

【讨论】：

也应该包括https
如果只拆分 url 字符串，我们可以使用 ^ 和 $。
@Braj - 你可以，但没必要。在 Python 中，re.match 默认匹配字符串的开头。