在正则表达式模式python中使用动态int变量答案

【问题标题】：use dynamic int variable inside regex pattern python在正则表达式模式python中使用动态int变量
【发布时间】：2020-04-03 06:32:37
【问题描述】：

我刚开始学习python，如果这个问题已经被问到了，很抱歉。

我写在这里是因为那些对我没有帮助，我的要求是读取一个文件并打印其中的所有 URL。在 for 循环中，我使用的正则表达式模式是 [^https://][\w\W]*，它工作正常。但我想知道我是否可以动态传递 https:// 之后的行长度并获得出现次数的输出而不是 *

我试过[^https://][\w\W]{var}}var=len(line)-len(https://)

这些是我尝试过的其他一些模式

pattern = '[^https://][\w\W]{'+str(int(var))+'}'

pattern = r'[^https://][\w\W]{{}}'.format(var)

pattern = r'[^https://][\w\W]{%s}'%var

【问题讨论】：

现在你的模式很奇怪。例如，您是否意识到[^https://] 与行首的字符串https:// 不匹配？相反，它匹配任何单个字符，它不是 h、t、p、s、: 或 / 之一。
是的，我现在意识到了我的 https 模式，感谢您发现它
@JACK，提供的任何答案对您有帮助吗，如果有，请记住标记答案，以便其他人也可以从中受益。

标签： python regex variables int

【解决方案1】：

我可能误解了您的问题，但如果您知道网址始终以https:// 开头，那么这将是前八个字符。然后找到url后就可以得到长度了：

# Example of list containing urls - you should fill that with your for loop
list_urls = ['https://stackoverflow.com/questions/61006253/use-dynamic-int-variable-inside-regex-pattern-python', 'https://google.com', 'https://stackoverflow.com']
for url in list_urls:
    print(url[8:])

出来

stackoverflow.com/questions/61006253/use-dynamic-int-variable-inside-regex-pattern-python
google.com
stackoverflow.com

您可以使用 re.findall 查找所有 url，而不是 for 循环

import re

url_pattern = "((https:\/\/)([\w-]+\.)+[\w-]+[.+]+([\w%\/~\+#]*))"
# text referes to your document, that should be read before this
urls = re.findall(url_pattern, text)

# Using list comprehensions
# Get the unique urls by using set
# Only get text after https:// using [8:]
# Only parse the first element of the group that is returned by re.findall using [0]
unique_urls = list(set([x[0][8:] for x in urls]))

# print the urls
print(unique_urls)

【讨论】：

【解决方案2】：

在您的模式中，您使用 [^https://]，这是一个 negated character class [^，它将匹配除所列字符之外的任何字符。

一种选择是使用文字字符串插值。假设您的链接不包含空格，您可以使用\S 而不是[\w\W]，因为后一种变体将匹配任何字符，包括空格和换行符。

\bhttps://\S{{{var}}}(?!\S)

Regex demo

末尾的断言(?!\S) 是一个空白边界，以防止部分匹配，而单词边界\b 将防止http 成为更大单词的一部分。

Python demo

例如

import re
line = "https://www.test.com"
lines = "https://www.test.com https://thisisatestt https://www.dontmatchme"

var=len(line)-len('https://')
pattern = rf"\bhttps://\S{{{var}}}(?!\S)"

print(re.findall(pattern, lines))

输出

['https://www.test.com', 'https://thisisatestt']

【讨论】：