是@Sebastian 版本的修改。
我将所有 data 成对保存在一个列表中 (header, text) 但我不直接将其添加到此列表中。
当我找到header 时,我将它保存在分隔变量header 中。当我找到text 时,我也将其保留在单独的列表text 中。只有当我找到下一个header 时,我才会将上一个header, text 添加到data。最后我必须将最后一个header, text 添加到data。我还使用header = None 来识别我是否找到了第一个标题而不是添加空对header, text。
因为我将所有text 保留为列表,所以我以后可以决定是要显示在一行还是分开的行中(例如Partners 中的--)
我还添加了<a> 的代码以获取电子邮件地址。我正在考虑为<br> 添加代码。
import requests
import bs4
from bs4 import BeautifulSoup as BS
url = 'https://www.energy.gov/eere/buildings/downloads/new-iglu-high-efficiency-vacuum-insulated-panel-modular-building-system'
r = requests.get(url)
soup = BS(r.text, 'html.parser')
content = soup.find_all('div', class_='field field--text_default field--body')
#print(content)
data = [] # list for pairs `(header, text)`
header = None # last found `header`
text = [] # all text found after last `header`
all_tags = content[0].find_all(["p","h4"])
for tag in all_tags:
for child in tag.children:
if isinstance(child, bs4.element.Tag):
if child.name in "strong":
# put previouse `header + text`
if header is not None: # don't before first header
data.append( [header, text] )
# remember new `header` and make place for new text
header = child.get_text().strip(": ")
text = []
#if child.name in "br":
# text.append('\n')
if child.name in "a":
text.append(child.get_text().strip())
if isinstance(child, bs4.element.NavigableString):
if child in ("Project Objective", "Project Impact", "Contacts"):
# put previouse `header + text`
if header is not None: # don't before first header
data.append( [header, text] )
# remember new `header` and make place for new text
header = child.strip()
text = []
else:
# remember `text`
text.append(child.strip())
# add last `header + text`
if header is not None: # don't before first header
data.append( [header, text] )
# --- display ---
print('len(data):', len(data), '\n')
for header, text in data:
print('header:', header)
print('--- text ---')
#print(' '.join(text).strip('\n'))
if header == 'Partners':
print('\n'.join(text))
else:
print(' '.join(text))
print('====================================')
结果:
只有标题 Contact 是空的,因为元素在标题 DOE Technology Manager 和 Lead Performer 中
len(data): 11
header: Lead Performer
--- text ---
Cold Climate Housing Research Center – Fairbanks, AK
====================================
header: Partners
--- text ---
-- Panasonic Corp. – Newark, NJ
-- Taġiuġmiullu Nunamiullu Housing Authority – Utqiagvik, AK
-- National Renewable Energy Laboratory, Golden, CO
====================================
header: DOE Total Funding
--- text ---
$375,161
====================================
header: Cost Share
--- text ---
$95,293
====================================
header: Project Term
--- text ---
July 2020 – May 2022
====================================
header: Funding Type
--- text ---
Advanced Building Construction FOA Award
====================================
header: Project Objective
--- text ---
Vacuum insulated panels (VIPs) are poised to transform the building industry by making homes more energy efficient with little additional upfront cost. However, they are currently uncommon due to their inherent fragility. As the R-value relies on the vacuum inside the panel, any damage to the panel negates the insulation value of the system. With today’s residential construction methods and fastener technology, it is nearly impossible to avoid damaging panels during assembly or over the life of the home. These issues make VIPs incompatible with current construction techniques. To overcome these issues and capitalize on the high R-value of VIPs, the project team will develop a new building system with durable assemblies that can perform in Arctic conditions. The long-term plan is to make the system a mass-market building platform that can address the need for affordable, high-efficiency housing across the nation. This starts with a proof of concept that will be built and tested at the Cold Climate Housing Research Center in Fairbanks, Alaska. Developing this concept in the country’s only Arctic state, which has the coldest temperatures and highest energy costs in the U.S., will ensure its durability and performance in other climates.
====================================
header: Project Impact
--- text ---
The energy-savings payback of this system is estimated to be eight years with applicability and potential benefit in every U.S. climate zone. For remote regions such as central Alaska, the payback would be even shorter as the cost of energy exceeds the assumed retail energy cost. Considering the building envelope alone, this system is expected to achieve a reduction in heating/cooling energy of at least 48% and an annual savings of 1,637 TBtu if implemented nationwide.
====================================
header: Contacts
--- text ---
====================================
header: DOE Technology Manager
--- text ---
Marc LaFrance, Marc.Lafrance@ee.doe.gov
====================================
header: Lead Performer
--- text ---
Bruno Grunau, Cold Climate Housing Research Center
====================================