【问题标题】:Add Hyperlinks to HTML using BeautifulSoup in Python using Anchor Text and URL stored in a CSV File使用存储在 CSV 文件中的锚文本和 URL 在 Python 中使用 BeautifulSoup 将超链接添加到 HTML
【发布时间】:2022-12-08 05:24:28
【问题描述】:

我想用 python beautiful soup 编写一个程序,使用带有 anchor_text 和超链接的 csv 文件来超链接 html 中的单词

包含 2 列的 CSV 文件:

anchor_text hyperlink
Google https://www.google.com
Bing https://bing.com
Yahoo https://yahoo.com
Active Campaign https://activecampaign.com

这是示例 HTML

<!-- wp:paragraph -->
<p>This is a existing link <a class="test" href="https://yahoo.com/">Yahoo</a> Text</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>This is another Google Text</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>This is another lowercase bing Text</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>This is another multi word Active Campaign Text</p>
<!-- /wp:paragraph -->

我希望输出是

<!-- wp:paragraph -->
<p>This is a existing link <a href="https://yahoo.com/">Yahoo</a> Text</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>This is another <a href="https://www.google.com/">Google</a> Text</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>This is another lowercase <a href="https://bing.com/">bing</a> Text</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>This is another multi word <a href="https://activecampaign.com/">Active Campaign</a> Text</p>
<!-- /wp:paragraph -->

这是我目前无法使用的代码。它剥离整个句子并用超链接替换它。

html_doc = """
<!-- wp:paragraph -->
<p>This is a existing link <a class="test" href="https://yahoo.com/">Yahoo</a> Text</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>This is another Google Text</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>This is another lowercase bing Text</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>This is another multi word Active Campaign Text</p>
<!-- /wp:paragraph -->
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# read the CSV file with anchor text and hyperlinks
with open('file.csv', 'r') as csv_file:
  reader = csv.reader(csv_file)
  hyperlinks = dict(reader)

# find all the text nodes in the HTML document
text_nodes = soup.find_all(text=True)

# iterate over the text nodes and replace the anchor text with hyperlinked text
for node in text_nodes:
  for anchor_text, hyperlink in hyperlinks.items():
    if anchor_text in node:
      # create a new tag with the hyperlink
      new_tag = soup.new_tag('a', href=hyperlink)
      new_tag.string = anchor_text
      # replace the original text node with the new one
      node.replace_with(new_tag)

# save the modified HTML to a new file
with open('index_hyperlinked.html', 'w') as outfile:
  outfile.write(str(soup))

print(soup)

【问题讨论】:

    标签: python html beautifulsoup


    【解决方案1】:

    我没有指定任何解析器——只是直接soup = BeautifulSoup(html_doc);它不应该有什么区别,但我想我应该提一下以防万一。

    您应该尝试在外循环上使用锚点/链接,然后在内循环中分解匹配的字符串:

    # from bs4 import element as bs4_element
    be_navStr = bs4_element.NavigableString
    
    hList = [
        (anchor_text.strip(), hyperlink.strip()) 
        anchor_text, hyperlink in hyperlinks.items()
        if anchor_text.strip() and hyperlink.strip() # no blanks
    ]
    
    print('#'*35, 'OLD', '#'*35, '
    ')
    print(soup, '
    ')
    print('#'*75, '
    
    
    ')
    
    for txt, link in hList:
        navStrs = [
            d for d in soup.descendants if type(d) == be_navStr 
            and f' {txt.lower()} ' in f' {d.get_text().strip().lower()} '
        ]
        for ns in navStrs: 
            tLen, remStr = len(txt), f' {ns.get_text().strip()} '
            if remStr[1:-1].lower() == txt.lower():
                # to skip if it's already a hyperlink
                if ns.parent.name == 'a': 
                    # ns.parent['href'] = link # if you want to replace/update link
                    continue 
    
            while f' {txt.lower()} ' in remStr.lower():
                sInd = remStr.lower().find(f' {txt.lower()} ') + 1
    
                hlTag = soup.new_tag('a', href=link)
                hlTag.append(remStr[sInd:sInd + tLen])
    
                newCont = [remStr[:sInd].lstrip(), hlTag, ' ']
                for addn in newCont: ns.insert_before(addn) 
    
                remStr = f' {remStr[sInd + tLen:].strip()} '
            ns.replace_with(remStr.strip())
    
    print('#'*35, 'NEW', '#'*35, '
    ')
    print(soup, '
    ')
    print('#'*75)
    

    打印输出:

    ################################### OLD ################################### 
    
    <!-- wp:paragraph -->
    <p>This is a existing link <a class="test" href="https://yahoo.com/">Yahoo</a> Text</p>
    <!-- /wp:paragraph -->
    <!-- wp:paragraph -->
    <p>This is another Google Text</p>
    <!-- /wp:paragraph -->
    <!-- wp:paragraph -->
    <p>This is another lowercase bing Text</p>
    <!-- /wp:paragraph -->
    <!-- wp:paragraph -->
    <p>This is another multi word Active Campaign Text</p>
    <!-- /wp:paragraph --> 
    
    ########################################################################### 
    
    
    
    ################################### NEW ################################### 
    
    <!-- wp:paragraph -->
    <p>This is a existing link <a class="test" href="https://yahoo.com/">Yahoo</a> Text</p>
    <!-- /wp:paragraph -->
    <!-- wp:paragraph -->
    <p>This is another <a href="https://www.google.com">Google</a> Text</p>
    <!-- /wp:paragraph -->
    <!-- wp:paragraph -->
    <p>This is another lowercase <a href="https://bing.com">bing</a> Text</p>
    <!-- /wp:paragraph -->
    <!-- wp:paragraph -->
    <p>This is another multi word <a href="https://activecampaign.com">Active Campaign</a> Text</p>
    <!-- /wp:paragraph --> 
    
    ###########################################################################
    

    即使在同一个字符串中有多个匹配项,只要它们不重叠(比如“谷歌浏览器”“铬测试版”)

    【讨论】: