【问题标题】:Using BeautifulSoup to get tags and text使用 BeautifulSoup 获取标签和文本
【发布时间】:2021-07-22 13:49:49
【问题描述】:

我现在正在尝试一段时间,但我被卡住了。我的网站具有以下结构(不幸的是,我只有一个屏幕截图,不知何故我无法复制粘贴代码......)

编辑:抱歉,当然,这是其中一个网址:

https://www.energy.gov/eere/buildings/downloads/new-iglu-high-efficiency-vacuum-insulated-panel-modular-building-system

我找到了 div class="field field etc.... 我想将所有内容存储在“strong”或“h4”中作为数据框列名(获取该部分)和相应的文本。我部分成功了,我只输了第二个

在“项目目标”下标记内容,我完全迷失在“合作伙伴”和
标记之间的文本中。 这就是我所做的:

content = soup.find_all('div', class_='field field--text_default field--body')

# For the headers:
headers = content[0].find_all(["strong","h4"])
col_names = []
for header in headers:
    col_names.append(header.text)

# and for the content:
con = []
divs = content[0].findAll(["strong", "h4"])
for el in divs:
    con.append(el.nextSibling)
con = [el.string for el in inhalt if el != None]

【问题讨论】:

  • 最好放置 URL 而不是图像或 HTML - 它有助于了解此页面的代码是如何工作的。
  • 也许得到<p>,然后使用for-loop 和children - 并检查它是否有标签名称strong。对于strong 将其保存在变量head 中,对于其他保存在text 中,当您获得下一个strong 时,将前一个head, text 放入某个列表中。不要尝试单独获取标题和内容。
  • 是的,我忘了提,我还需要“h4”作为列名。

标签: python beautifulsoup tags screen-scraping


【解决方案1】:

是@Sebastian 版本的修改。

我将所有 data 成对保存在一个列表中 (header, text) 但我不直接将其添加到此列表中。

当我找到header 时,我将它保存在分隔变量header 中。当我找到text 时,我也将其保留在单独的列表text 中。只有当我找到下一个header 时,我才会将上一个header, text 添加到data。最后我必须将最后一个header, text 添加到data。我还使用header = None 来识别我是否找到了第一个标题而不是添加空对header, text

因为我将所有text 保留为列表,所以我以后可以决定是要显示在一行还是分开的行中(例如Partners 中的--

我还添加了<a> 的代码以获取电子邮件地址。我正在考虑为<br> 添加代码。

import requests
import bs4
from bs4 import BeautifulSoup as BS

url = 'https://www.energy.gov/eere/buildings/downloads/new-iglu-high-efficiency-vacuum-insulated-panel-modular-building-system'

r = requests.get(url)

soup = BS(r.text, 'html.parser')

content = soup.find_all('div', class_='field field--text_default field--body')
#print(content)

data = []   # list for pairs `(header, text)`

header = None  # last found `header`
text = []      # all text found after last `header`


all_tags = content[0].find_all(["p","h4"])

for tag in all_tags:

    for child in tag.children:
        if isinstance(child, bs4.element.Tag):
            if child.name in "strong":
                # put previouse `header + text`
                if header is not None:  # don't before first header
                    data.append( [header, text] )

                # remember new `header` and make place for new text
                header = child.get_text().strip(": ")
                text = []

            #if child.name in "br":
            #    text.append('\n')
                
            if child.name in "a":
                text.append(child.get_text().strip())

        if isinstance(child, bs4.element.NavigableString):
            if child in ("Project Objective", "Project Impact", "Contacts"):
                # put previouse `header + text`
                if header is not None:  # don't before first header
                    data.append( [header, text] )

                # remember new `header` and make place for new text
                header = child.strip()
                text = []
            else:
                # remember `text`
                text.append(child.strip())

# add last `header + text`
if header is not None:  # don't before first header
    data.append( [header, text] )

# --- display ---

print('len(data):', len(data), '\n')

for header, text in data:
    print('header:', header)
    print('--- text ---')
    #print(' '.join(text).strip('\n'))
    if header == 'Partners':
        print('\n'.join(text))
    else:        
        print(' '.join(text))
    print('====================================')

结果:

只有标题 Contact 是空的,因为元素在标题 DOE Technology ManagerLead Performer

len(data): 11 

header: Lead Performer
--- text ---
Cold Climate Housing Research Center – Fairbanks, AK
====================================
header: Partners
--- text ---
-- Panasonic Corp. – Newark, NJ
-- Taġiuġmiullu Nunamiullu Housing Authority – Utqiagvik, AK
-- National Renewable Energy Laboratory, Golden, CO
====================================
header: DOE Total Funding
--- text ---
$375,161
====================================
header: Cost Share
--- text ---
$95,293
====================================
header: Project Term
--- text ---
July 2020 – May 2022
====================================
header: Funding Type
--- text ---
Advanced Building Construction FOA Award
====================================
header: Project Objective
--- text ---
Vacuum insulated panels (VIPs) are poised to transform the building industry by making homes more energy efficient with little additional upfront cost. However, they are currently uncommon due to their inherent fragility. As the R-value relies on the vacuum inside the panel, any damage to the panel negates the insulation value of the system. With today’s residential construction methods and fastener technology, it is nearly impossible to avoid damaging panels during assembly or over the life of the home. These issues make VIPs incompatible with current construction techniques. To overcome these issues and capitalize on the high R-value of VIPs, the project team will develop a new building system with durable assemblies that can perform in Arctic conditions. The long-term plan is to make the system a mass-market building platform that can address the need for affordable, high-efficiency housing across the nation. This starts with a proof of concept that will be built and tested at the Cold Climate Housing Research Center in Fairbanks, Alaska. Developing this concept in the country’s only Arctic state, which has the coldest temperatures and highest energy costs in the U.S., will ensure its durability and performance in other climates.
====================================
header: Project Impact
--- text ---
The energy-savings payback of this system is estimated to be eight years with applicability and potential benefit in every U.S. climate zone. For remote regions such as central Alaska, the payback would be even shorter as the cost of energy exceeds the assumed retail energy cost. Considering the building envelope alone, this system is expected to achieve a reduction in heating/cooling energy of at least 48% and an annual savings of 1,637 TBtu if implemented nationwide.
====================================
header: Contacts
--- text ---

====================================
header: DOE Technology Manager
--- text ---
Marc LaFrance, Marc.Lafrance@ee.doe.gov 
====================================
header: Lead Performer
--- text ---
Bruno Grunau, Cold Climate Housing Research Center
====================================

【讨论】:

    【解决方案2】:

    遵循 furas 并与孩子一起工作,我再次发现以下作为部分解决方案:

    headers, inhalt = [],[]
    tag = content[0].find_all(["p","h4"])
    for i in range(len(tag)):
        for child in tag[i].children:
            if type(child) == bs4.element.Tag:
                if child.name == "strong":
                    headers.append(child.get_text().strip(": "))
                    #print("\n",type(child), " ",child.name, child, child.get_text().strip(": "))
            if type(child) == bs4.element.NavigableString:
                if child == "Project Objective" or child == "Project Impact" or child == "Contacts":
                    headers.append(child)
                else:
                    inhalt.append(child)
    

    不幸的是,我必须在一个标题中放置一次 3 个孩子和一次两个孩子。这三个确实总是以“--”开头,所以应该不会太难,但是如何选择进入一个单元格的两个单独的

    【讨论】:

    • 首先您应该将headerinhalt 作为pair 保留在一个列表中。在分隔变量中,您应该只保留最后一个标题,以及自最后一个标题以来的所有文本。当您找到新标题时,您将最后一个标题和自最后一个标题以来的所有文本作为对(header, text)。这样你应该得到标题之间的所有文本。
    • 或者您应该首先在p 中分别解析数据,然后在h4p 中分别解析数据。这样可能会更简单。
    猜你喜欢
    • 2020-01-15
    • 1970-01-01
    • 2015-08-29
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-06-16
    相关资源
    最近更新 更多