Python 字符串替换：URL 中的关键字答案

【问题标题】：Python String Replace: Keywords into URLsPython 字符串替换：URL 中的关键字
【发布时间】：2012-06-09 10:55:38
【问题描述】：

我打算用字符串中的url替换一些关键字，例如，

content.replace("Google","<a href="http://www.google.com">Google</a>")

但是，我只想用 url 替换关键字，前提是它还没有包含在 url 中。

内容是简单的HTML：

<p><b>This is an example!</b></p><p>I love <a href="http://www.google.com">Google</a></p><p><a href="http://www.google.com"><img src="/google.jpg" /></a></p>

主要是<a>和<img>标签。

主要问题：如何判断一个关键字是否已经包含在<a>或<img>标签中？

这是 PHP find and replace keywords with urls ONLY if not already wrapped in a url 中的一个类似问题，但答案不是一个有效的问题。

在 Python 中有更好的解决方案吗？代码示例更好。谢谢！

【问题讨论】：

您能否举例说明您希望在哪种文本上运行此功能？
@Acorn HTML 网页。示例：This is an example!I love <a href="http://www.google.com">Google</a><a href="http://www.google.com"><img src="/google.jpg" /></a>
您可以使用我在下面显示的示例创建一个与或标签匹配的正则表达式。

标签： python string google-app-engine utf-8

【解决方案1】：

我使用Beatiful Soup 来解析我的HTML，因为parsing 带有正则表达式的HTML 可以......被证明是棘手的。如果您使用漂亮的汤，您可以玩弄 previous_sibling 和 previous_element 找出您需要什么。

你以这种方式互动：

soup.find_all('img')

【讨论】：

【解决方案2】：

正如 Chris-Top 所说，BeautifulSoup 是正确的选择：

from BeautifulSoup import BeautifulSoup, Tag, NavigableString
import re    

html = """
<div>
    <p>The quick brown <a href='http://en.wikipedia.org/wiki/Dog'>fox</a> jumped over the lazy Dog</p>
    <p>The <a href='http://en.wikipedia.org/wiki/Dog'>dog</a>, who was, in reality, not so lazy, gave chase to the fox.</p>
    <p>See image for reference:</p>
    <img src='dog_chasing_fox.jpg' title='Dog chasing fox'/>
</div>
"""
soup = BeautifulSoup(html)

#search term, url reference
keywords = [("dog","http://en.wikipedia.org/wiki/Dog"),
            ("fox","http://en.wikipedia.org/wiki/Fox")]

def insertLinks(string_value,string_href):
    for t in soup.findAll(text=re.compile(string_value, re.IGNORECASE)):
            if t.parent.name !='a':
                    a = Tag('a', name='a')
                    a['href'] = string_href
                    a.insert(0, NavigableString(string_value))
                    string_list = re.compile(string_value, re.IGNORECASE).split(t)
                    replacement_text = soup.new_string(string_list[0])
                    t.replace_with(replacement_text)
                    replacement_text.insert_after(a)
                    a.insert_after(soup.new_string(string_list[1]))


for word in keywords:
    insertLinks(word[0],word[1])

print soup

将产生：

<div>
    <p>The quick brown <a href="http://en.wikipedia.org/wiki/Dog">fox</a> jumped over the lazy <a href="http://en.wikipedia.org/wiki/Dog">dog</a></p>
    <p>The <a href="http://en.wikipedia.org/wiki/Dog">dog</a>, who was, in reality, not so lazy, gave chase to the <a href="http://en.wikipedia.org/wiki/Fox">fox</a>.</p>
    <p>See image for reference:</p>
    <img src="dog_chasing_fox.jpg" title="Dog chasing fox"/>
</div>

【讨论】：

哇，我一直在尝试使用 HTMLParser 库解决问题……我花了大约 3 个小时来解决这个问题……然后已经为它制作了一个库:(
@Kevin P 感谢您花时间提交一些工作代码 :)

【解决方案3】：

您可以尝试添加上一篇文章中提到的正则表达式。首先根据正则表达式检查您的字符串，以检查它是否已包含在 URL 中。这应该很容易，因为对 re 库及其 search() 方法的简单调用就可以解决问题。

如果您需要正则表达式和具体的搜索方法，这里有一个很好的教程：http://www.tutorialspoint.com/python/python_reg_expressions.htm

在检查字符串是否已经包含在 URL 中之后，如果它还没有包含在 URL 中，则可以调用替换函数。

这是我写的一个简单的例子：

    import re

    x = "<a href=""http://www.google.com"">Google</a>"
    y = 'Google'

    def checkURL(string):
        if re.search(r'<a href.+', string):
            print "URL Wrapped Already"
            print string
        else:
            string = string.replace('Google', "<a href=""http://www.google.com"">Google</a>")
            print "URL Not Wrapped:"
            print string

    checkURL(x)
    checkURL(y)

我希望这能回答你的问题！

【讨论】：

嗯？我似乎没有得到你。我不是在搜索特定的字符串。仅当尚未包含在 url 中时，我才想用 url 替换关键字。
你能举个例子吗？