【问题标题】:Combing text output with BeautifulSoup将文本输出与 BeautifulSoup 结合起来
【发布时间】:2020-02-03 07:17:35
【问题描述】:

我在解析文件中的链接时遇到问题,因为它不是完整链接,要解析的文本是:

<enclosure url="/itunes/463/RKBU-How-the-Seas-Shaped-Humanit-02019_09_24_13_40_18-0.mp3" length="83586948" type="audio/mpeg"/>

链接应该是:

https://www.opednews.com/itunes/463/RKBU-How-the-Seas-Shaped-Humanit-02019_09_24_13_40_18-0.mp3

如何将网站地址的第一部分包含在以下输出中生成的链接中,以便地址是完整的?任何建议将不胜感激。

def get_playable_podcast1(soup1):
    subjects = []
    for content in soup1.find_all('item', limit=9):
        try:
            link = content.find('enclosure')
            link = link.get('url')
            print("\n\nLink: ", link)
            title = content.find('title')
            title = title.get_text()
        except AttributeError:
            continue
        item = {
                'url': link,
                'title': title,
                'thumbnail': "https://upload.wikimedia.org/wikipedia/en/thumb/2/21/OpEdNews_%28logo%29.jpg/200px-OpEdNews_%28logo%29.jpg",
        }
        subjects.append(item)
    return subjects

【问题讨论】:

    标签: python parsing beautifulsoup


    【解决方案1】:

    您可以将BeautifulSoupurllib.parse.urljoin 一起使用:

    import urllib.parse
    from bs4 import BeautifulSoup as soup
    url, html = 'https://www.opednews.com', '<enclosure url="/itunes/463/RKBU-How-the-Seas-Shaped-Humanit-02019_09_24_13_40_18-0.mp3" length="83586948" type="audio/mpeg"/>'
    result = urllib.parse.urljoin(url, soup(html, 'html.parser').enclosure['url'])
    

    输出:

    'https://www.opednews.com/itunes/463/RKBU-How-the-Seas-Shaped-Humanit-02019_09_24_13_40_18-0.mp3'
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-11-08
      • 1970-01-01
      • 2017-05-22
      • 1970-01-01
      • 2020-12-31
      相关资源
      最近更新 更多