将字符串转换为 Beautiful Soup 对象答案

【问题标题】：convert a string into a Beautiful Soup object将字符串转换为 Beautiful Soup 对象
【发布时间】：2020-05-05 00:11:11
【问题描述】：

我对 python 很陌生并在这里发帖，所以任何帮助都将不胜感激！我正在尝试使用 Beautiful Soup 来动态解析 30 多个不同的 RSS 博客提要。令人惊讶的是，它们不是标准的。因此，我首先创建了一个我想要抓取的所有潜在 xml 标记的列表，我将其命名为 headers：

headers = ['title', 'description', 'author', 'credit', 'pubDate', 'link', 'origLink']

然后我从我试图抓取的 RSS 提要中抓取所有标签并将它们放入自己的列表中，命名为标签：

import requests
from bs4 import BeautifulSoup as bs
requests.packages.urllib3.disable_warnings()

headers = ['title', 'description', 'author', 'credit', 'pubDate', 'link', 'origLink']

url = 'https://www.zdnet.com/blog/security/rss.xml'
resp = requests.get(url, verify=False)
soup = bs(resp.text, features='xml')
data = soup.find_all('item')

tags = [tag.name for tag in data[0].find_all()]
print(tags)

然后我建立一个新的标签列表，n_tags，两个列表中的元素重叠：

n_tags = [i for i in headers if i in tags]
print(n_tags)

然后我遍历数据中的所有项目（页面上的所有博客文章），并遍历新标签列表中的所有元素（与该博客相关的所有标签）。我卡住的地方是 n_tags 是字符串列表，而不是汤对象。

解析提要的手动方式是：

for item in data:
    print(item.title.text)
    print(item.description.text)
    print(item.pubDate.text)
    print(item.credit.text)
    print(item.link.text)

但是，我想遍历标签列表并将它们插入到代码中以获取xml标签的内容。

for item in data:
    for el in n_tags:
    content = item + "." + el + ".text"
    print(content)

这会返回一个错误：

TypeError: unsupported operand type(s) for +: 'Tag' and 'str'

我需要将列表中的字符串转换为汤“标签”对象，以便将它们连接起来。我尝试将 Tag 对象重新转换为字符串并将整个字符串重新建立为汤对象，但它不起作用。它没有出错，只是返回了 None

content = str(item) + "." + el + ".text"
print(soup.content)

我得到的最接近的是：

for item in data:
    for el in n_tags:
        content = str(item) + "." + el + ".text"
        print(content)

它实际上返回内容，但它不是我要找的，“.text”似乎没有被应用，并且对于列表中的每个元素，博客文章内容都是重复的。

我没有想法，感谢阅读。如果您有任何问题，请告诉我。

【问题讨论】：

标签： python python-3.x beautifulsoup rss rss-reader

【解决方案1】：

我不确定我是否正确理解了您的问题，但您似乎正在尝试仅从 RSS 提要中的特定元素中选择文本。

你可以试试这个脚本（使用 CSS 选择器）：

import requests
from bs4 import BeautifulSoup as bs

url = 'https://www.zdnet.com/blog/security/rss.xml'
soup = bs(requests.get(url).content, 'html.parser')

headers = ['title', 'description', 'author', 'credit', 'pubDate', 'link', 'origLink']

for tag in soup.select(','.join(headers)):
    print(tag.text)

打印：

ZDNet | security RSS

Tue, 05 May 2020 00:15:23 +0000

ZDNet | security RSS

US financial industry regulator warns of widespread phishing campaign
FINRA warns of phishing campaign aimed at stealing members' Microsoft Office or SharePoint passwords.
Mon, 04 May 2020 23:29:00 +0000

Academics turn PC power units into speakers to leak secrets from air-gapped systems
POWER-SUPPLaY technique uses "singing capacitor" phenomenon for data exfiltration.
Mon, 04 May 2020 16:06:00 +0000

... and so on.

【讨论】：