从 BeautifulSoup 结果集中分离元素答案

【问题标题】：Separate elements from BeautifulSoup Resultset从 BeautifulSoup 结果集中分离元素
【发布时间】：2019-06-18 05:31:55
【问题描述】：

我正在使用 Python(3.7) 和 BeautifulSoup(4) 开发一个项目，在该项目中我需要在不知道 HTML 的确切结构的情况下抓取一些数据，但通过假设用户相关信息将在headings, paragraph, pre and code 标签中。在这些标签的find_all 之后，我想将headings and paragraph 标签与code and pre 标签与ResultSet 对象分开。

这是我尝试过的：

required_tags = ["h1", "h2", "h3", "h4", "h5", "pre", "code", "p"]
text_outputs = []
code_outputs = []
pages = [
        "https://bugs.launchpad.net/bugs/1803780",
        "https://bugs.launchpad.net/bugs/1780224",
        "https://docs.openstack.org/keystone/pike/_modules/keystone/assignment/core.html",
        "https://openstack-news.blogspot.com/2018/11/bug-1803780-confusing-circular.html",
        "https://www.suse.com/documentation/suse-openstack-cloud-9/doc-cloud-upstream-user/user"
        "/html/keystone/_modules/keystone/assignment/core.html"
    ]


page = requests.get(pages[0])
    html_text = BeautifulSoup(page.text, 'html.parser')
    text = html_text.find_all(required_tags)
    elements = []
    for e in html_text:
        elements.append(e.parent)
    for t in text:
        for e in elements:
            if e == 'code' or e == 'pre':
                print(e)
                code_outputs.append(t.get_text())
            else:
                text_outputs.append(t.get_text())

但它不会返回 code_outputs 和 text_outputs 中的任何内容。

提前致谢！

【问题讨论】：

您要解析什么网址？请补充
这是一个 url 列表，让我补充一下！
是的，添加所有相关代码以减少猜测
嗨@DeveshKumarSingh 我已经添加了所有urls。
你能仔细检查你拥有的网址是否真的有代码和前置标签吗？我不认为他们都有

标签： python python-3.x web-scraping beautifulsoup

【解决方案1】：

只需从类似元素中获取父级名称

t.parent.name =='code'

而不是创建父元素列表。

【讨论】：

它不是从code 和pre 获取结果，例如，如果你会看到第三页的url，我们有一个pre 代码sn-p 标签，它应该在code_outputs 中，但未获取。
你只打开了第一页

【解决方案2】：

你可以试试这个：

from bs4 import BeautifulSoup

required_tags = ["h1", "h2", "h3", "h4", "h5", "pre", "code", "p"]
text_outputs = []
code_outputs = []
pages = [
        "https://bugs.launchpad.net/bugs/1803780",
        "https://bugs.launchpad.net/bugs/1780224",
        "https://docs.openstack.org/keystone/pike/_modules/keystone/assignment/core.html",
        "https://openstack-news.blogspot.com/2018/11/bug-1803780-confusing-circular.html",
        "https://www.suse.com/documentation/suse-openstack-cloud-9/doc-cloud-upstream-user/user"
        "/html/keystone/_modules/keystone/assignment/core.html"
    ]


page = requests.get(pages[2], verify=False)


html_text = BeautifulSoup(page.text, 'html.parser')
elements = {}


for tag in required_tags:
    data=list(html_text.find_all(tag))
    data = [dat.text for dat in data]
    if tag == "code" or tag=="pre":
        code_outputs+=data
    else:
        text_outputs+=data

【讨论】：

【解决方案3】：

您没有获得任何数据，因为您迭代了不需要的额外内部 for 循环

 for e in elements:
     if e == 'code' or e == 'pre':

见上述条件，您在子标签列表中迭代父标签以进行循环并比较tag object with the string。您已经在text 列表对象中获取了预标记数据。

for page in pages:
    res = requests.get(page)
    html_text = BeautifulSoup(res.text, 'html.parser')
    text = html_text.find_all(required_tags)   
    for t in text:
        if t.name == 'code' or t.name == 'pre':
            print("===if===")
            code_outputs.append(t.get_text())
        else:
            print("===else===")
            text_outputs.append(t.get_text())

print(code_outputs)
print(text_outputs)

更新：

json_data = []
for page in pages:
    res = requests.get(page)
    html_text = BeautifulSoup(res.text, 'html.parser')
    text = html_text.find_all(required_tags)
    for t in text:
        if t.name == 'code' or t.name == 'pre':
            code_outputs.append(t.get_text())
        else:
            text_outputs.append(t.get_text())

    data = {page:{"html":text,"code_outputs":code_outputs,"text_outputs":text_outputs}}
    json_data.append(data)

print(json_data)

【讨论】：

好的，我怎样才能添加带有结果的链接，因为最终输出应该是一个包含链接的列表，它是文本和代码。
是单独添加链接和相关信息，我必须添加链接，然后是特定的文本或代码。
我们可以格式化响应中的代码片段吗？看起来很奇怪！
@Abdul Rehman 你想要什么，废弃网站文本或只存储过滤的 HTML