【问题标题】:How do I convert URL in a plain text to clickable links using python?如何使用 python 将纯文本中的 URL 转换为可点击的链接?
【发布时间】:2016-01-26 16:27:58
【问题描述】:

我有纯文本,例如,考虑以下句子:

我在浏览 www.google.com 时发现了一个有趣的网站 www.stackoverflow.com。太棒了!

在上面的示例中,www.google.com 是纯文本,我需要将其转换为 www.google.com(包裹在锚标记中,具有指向 google.com 的链接)。而 www.stackoverflow.com 已经在锚标签中,我想保持不变。如何使用 Python 正则表达式来做到这一点?

【问题讨论】:

  • 使用 HTML 解析器将是一个更好的解决方案。正则表达式不适合这样的工作!
  • @Docteur 您能否提供一些使用 HTML 解析器替换文本的简单“操作方法”示例?太感谢了 ! :)

标签: python html regex url


【解决方案1】:

这个任务必须分成两部分:

  • 提取所有不在a标签中的文本
  • 找到(或者说是猜测)该文本中的所有 url 并将它们包装起来

对于第一部分,我建议使用BeautifulSoup。你也可以使用html.parser,但这会是很多额外的工作

使用递归函数查找文本:

from bs4 import BeautifulSoup
from bs4.element import NavigableString

your_text = """I was surfing <a href="...">www.google.com</a>, and I found an
interesting site https://www.stackoverflow.com/. It's amazing! I also liked
Heroku (http://heroku.com/pricing)
more.domains.tld/at-the-end-of-line
https://at-the_end_of-text.com"""

soup = BeautifulSoup(your_text, "html.parser")

def wrap_plaintext_links(bs_tag):
    for element in bs_tag.children:
        if type(element) == NavigableString:
            pass # now we have a text node, process it
        # so it is a Tag (or the soup object, which is for most purposes a tag as well)
        elif element.name != "a": # if it isn't the a tag, process it recursively
            wrap_plaintext_links(element)

wrap_plaintext_links(soup) # call the recursive function

您可以通过将pass 替换为print(element) 来测试它是否只找到您想要的值。


现在查找 url 并替换自身。使用的正则表达式的复杂性实际上取决于您想要的精确度。我会选择这个:

(https?://)?        # match http(s):// in separate group if present
(                   # start of the main capturing group, what will be between the tags
  (?:[\w-]+\.)+     #   at least one domain and any subdomains before TLD
  [a-z]+            #   TLD
  (?:/\S*?)?        #   /[anything except whitespace] if present - URL path
)                   # end of the group
(?=[\.,)]?(?:\s|$)) # prevent matching any of ".,)" that might appear immediately after the URL as the text goes...

函数和代码添加,包括替换:

import re

def create_replacement(matchobj):
    if matchobj.group(1): # if there's http(s)://, keep it
        full_url = matchobj.group(0)
    else: # otherwise prepend it. it would be a long discussion if https or http. decide.
        full_url = "http://" + matchobj.group(2)
    tag = soup.new_tag("a", href=full_url)
    tag.string = matchobj.group(2)
    return str(tag)

# compile the pattern beforehand, as it's going to be used many times
r = re.compile(r"(https?://)?((?:[\w-]+\.)+[a-z]+(?:/\S*?)?)(?=[\.,)]?(?:\s|$))")

def wrap_plaintext_links(bs_tag):
    for element in bs_tag.children:
        if type(element) == NavigableString:
            replaced = r.sub(create_replacement, str(element))
            element.replaceWith(BeautifulSoup(replaced)) # make it a Soup so that the tags aren't escaped
        elif element.name != "a":
            wrap_plaintext_links(element)

注意:你也可以在我上面写的代码中包含模式解释,见re.X标志

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2020-08-25
    • 2023-03-11
    • 2014-02-22
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-11-24
    • 1970-01-01
    相关资源
    最近更新 更多