仅输出匹配的正则表达式模式答案

【问题标题】：Only output matching regex pattern仅输出匹配的正则表达式模式
【发布时间】：2018-08-04 20:10:34
【问题描述】：

我有一个包含 10,000 行的 csv 文件。每行有 8 列。其中一列包含与此类似的文本：

this is a row:   http://somedomain.com | some_text | http://someanotherdomain.com | some_more_text
this is a row:   http://yetanotherdomain.net
this is a row:   https://hereisadomain.org | some_text

我目前正在以这种方式访问此列中的数据：

for row in csv_reader:
    the_url = row[3]

    # this regex is used to find the hrefs
    href_regex = re.findall('(?:http|ftp)s?://.*', the_url)
    for link in href_regex:
         print (link)

打印语句的输出：

http://somedomain.com | some_text | http://someanotherdomain.com | some_more_text
http://yetanotherdomain.net
https://hereisadomain.org | some_text

如何仅获取 URL？

http://somedomain.com
http://someanotherdomain.com 
http://yetanotherdomain.net
https://hereisadomain.org

【问题讨论】：

标签： regex python-3.x csv

【解决方案1】：

只需将您的模式更改为：

\b(?:http|ftp)s?://\S+

不要用.* 匹配任何东西，而是用\S+ 匹配任何非空白字符。您可能还想在非捕获组之前添加一个单词边界。

实时查看here。

【讨论】：

不客气，考虑接受我的回答，因为它解决了您的问题。请参阅here 如何操作。
我试图以正确的方式接受答案，但我被提升为等待 10 分钟。

【解决方案2】：

而不是在末尾重复任何字符

'(?:http|ftp)s?://.*'
                  ^

重复任何字符除了空格，以确保该模式将在 URL 的末尾停止匹配：

'(?:http|ftp)s?://[^ ]*'
                  ^^^^

【讨论】：

实际上这是不正确的，您只是在否定空格而不是新行。见here。加上一个单词边界也很好。
@UnbearableLightness OP 的代码在每个 URL 之前都有文本 - 如果这是一个问题，他会看到他的 http://yetanotherdomain.net 与他的 https://hereisadomain.org 混合在一起。我对单词边界持怀疑态度，因为 URL 可以包含结尾的非单词字符，这些字符仍然是 URL 的重要组成部分