Python re.findall 之后返回带有不需要的字符串的链接

【问题标题】：Python re.findall returning links with unwanted string afterwardsPython re.findall 之后返回带有不需要的字符串的链接
【发布时间】：2017-10-11 20:00:23
【问题描述】：

我有一个使用 BeautifulSoup 进行抓取的 python 脚本。这是我的代码：

re.findall('stream:\/\/.+', link)

旨在查找以下链接：

stream://987cds9c8ujru56236te2ys28u99u2s

但它也返回这样的字符串：

stream://987cds9c8ujru56236te2ys28u99u2s  [SD] Spanish - (9.15am)

即有我不想要的空格和额外的东西。我该如何表达

re.findall

所以它只返回链接的第一部分？

（提前致谢）

【问题讨论】：

$ 是字符串结尾的正则表达式

标签： python beautifulsoup findall scraper

【解决方案1】：

您可以使用非贪婪匹配（将? 添加到模式中）与单词边界字符'\b'：

>>> re.findall(r'stream:\/\/.+?\b', link)
['stream://987cds9c8ujru56236te2ys28u99u2s']

或者如果你只想匹配单词字符，你可以简单地使用'\w+':

>>> re.findall(r'stream:\/\/\w+', link)
['stream://987cds9c8ujru56236te2ys28u99u2s']

【讨论】：

感谢您的帮助，但不幸的是，链接中的字符串不是单词 - 它可以由字母和数字组成，因此单词边界或单词字符不起作用。我将尝试使用上面的 $ 建议作为字符串结尾的边界。不过，非贪婪的建议会有所帮助。
@manners 我说的是if，从来没有说过它在所有情况下都有效。另外，你的意思是第一种方法行不通？