如何使用 Beautiful Soup 获取 html 页面中两个标签之间的内容？答案

【问题标题】：How can I get the contents between two tags in a html page using Beautiful Soup?如何使用 Beautiful Soup 获取 html 页面中两个标签之间的内容？
【发布时间】：2020-10-17 12:39:46
【问题描述】：

我正在尝试从 SEC 的 EDGAR 数据库 https://www.sec.gov/Archives/edgar/data/101830/000010183019000022/sprintcorp201810-k.htm 中提取此 10K 报告的风险因素部分的文本

如您所见，我已经设法确定了风险因素（我想从中获取所有文本的部分）和未解决的员工评论（风险因素之后的部分）部分的标题，但我无法继续识别/抓取从这些到标题之间的所有文本（风险因素部分中的文本）。

正如您在此处看到的，我已经尝试了“next_sibling”方法和其他一些关于 SO 的建议，但我仍然做错了。

代码：


import requests
import bs4 as bs

file = requests.get('https://www.sec.gov/Archives/edgar/data/101830/000010183019000022/sprintcorp201810-k.htm')
soup = bs.BeautifulSoup(file.content, 'html.parser')
risk_factors_header = soup.find_all("a", text="Risk Factors")[0]
staff_comments_header = soup.find_all("a", text="Unresolved Staff Comments")[0]
risk_factors_text = risk_factors_header.next_sibling

print(risk_factors_text.contents)

所需输出的摘录（查找风险因素部分中的所有文本）：

In addition to the other information contained in this Annual Report on Form 10-K, the following risk factors should be considered carefully in evaluating us. Our business, financial condition, liquidity or results of operations could be materially adversely affected by any of these risks.
Risks Relating to the Merger Transactions
The closing of the Merger Transactions is subject to many conditions, including the receipt of approvals from various governmental entities, which may not approve the Merger Transactions, may delay the approvals for, or may impose conditions or restrictions on, jeopardize or delay completion of, or reduce the anticipated benefits of, the Merger Transactions, and if these conditions are not satisfied or waived, the Merger Transactions will not be completed.
The completion of the Merger Transactions is subject to a number of conditions, including, among others, obtaining certain governmental authorizations, consents, orders or other approvals and the absence of any injunction prohibiting the Merger Transactions or any legal requ........

【问题讨论】：

标签： python web-scraping beautifulsoup

【解决方案1】：

几个问题：

您从目录而不是标题中选择链接：标题不是a 标记，而只是font 标记（您始终可以在浏览器中查看这些详细信息）。但是，如果您尝试执行 soup.find_all("font", text="Risk Factors")，您将得到 2 个结果，因为目录中的链接也有一个 font 标签，因此您需要选择第二个：soup.find_all("font", text="Risk Factors")[1]。
第二个标题的类似问题，但这次发生了一些有趣的事情：标题在结束标记之前有一个“不可见”的空间，虽然来自 TOC 的链接没有，所以你需要像这样选择它soup.find_all("font", text="Unresolved Staff Comments ")[0]。
另一个问题，“中间文本”不是我们选择的树元素的兄弟（或兄弟），而是具有这些元素的祖先的兄弟。如果您检查页面源代码，您将看到标题包含在 div 内、表格单元格内 (td)、表格行内 (tr) 内、table 内，所以我们需要提升 4 个家长级别：risk_factors_header.parent.parent.parent.parent。
另外，还有几个您感兴趣的兄弟姐妹，最好使用next_siblings 并遍历所有兄弟姐妹。
完成所有这些后，您可以使用第二个标题来中断迭代。
由于您只想获取文本（忽略所有 html 标签），您可以使用get_text() 而不是content。

好的，一起来：

import requests                                                                                                                                                                      
import bs4 as bs                                                                                                                                                                     
                                                                                                                                                                                     
file = requests.get('https://www.sec.gov/Archives/edgar/data/101830/000010183019000022/sprintcorp201810-k.htm')                                                                      
soup = bs.BeautifulSoup(file.content, 'html.parser')                                                                                                                                 
risk_factors_header = soup.find_all("font", text="Risk Factors")[1]                                                                                                                  
staff_comments_header = soup.find_all("font", text="Unresolved Staff Comments ")[0]                                                                                                  
                                                                                                                                                                                     
for paragraph in risk_factors_header.parent.parent.parent.parent.next_siblings:                                                                                             
    if paragraph == staff_comments_header.parent.parent.parent.parent:                                                                                                               
        break                                                                                                                                                                        
                                                                                                                                                                                     
    print(paragraph.get_text())

【讨论】：

【解决方案2】：

我将采取与此处的其他答案完全不同的方法 - 因为您正在处理 EDGAR 文件，这些文件作为一般问题非常糟糕，在涉及 html 时尤其糟糕（而且，如果您不幸不得不处理有了它，xbrl)。

因此，为了提取风险因素部分，我采用以下方法。它依赖于这样一个事实，即风险因素始终是第 IA 项，并且始终（至少在我目前的经验中）紧随其后的是第 1B 项，即使在本例中，第 IB 项是“无”。

filing = ''
for f in soup.select('font'):
    if f.text is not None and t.text != "Table of Contents":
        filing+=(f.text)+" \n"
print(filing.split('Item 1B')[0].split('Item 1A')[-1])

您丢失了大部分格式，并且一如既往，无论如何都会进行一些清理工作，但它已经足够接近 - 在大多数情况下。

请注意，这是 EDGAR，迟早你会遇到另一个文件，其中文本不在 <font> 中，而是在其他标签中 - 所以你必须采用...

【讨论】：

【解决方案3】：

另一种解决方案。您可以使用.find_previous_sibling() 来检查您是否在您想要的区域内：

import requests
from bs4 import BeautifulSoup


url = 'https://www.sec.gov/Archives/edgar/data/101830/000010183019000022/sprintcorp201810-k.htm#s8925A97DDFA55204808914F6529AC721'
soup = BeautifulSoup(requests.get(url).content, 'lxml')

out = []
for tag in soup.find('text').find_all(recursive=False):
    prev = tag.find_previous_sibling(lambda t: t.name == 'table' and t.text.startswith('Item'))
    if prev and prev.text.startswith('Item 1A.') and not tag.text.startswith('Item 1B'):
        out.append(tag.text)

# print the section:
print('\n'.join(out))

打印：

In addition to the other information contained in this Annual Report on Form 10-K, the following risk factors should be considered carefully in evaluating us. Our business, financial condition, liquidity or results of operations could be materially adversely affected by any of these risks.
Risks Relating to the Merger Transactions

...


agreed to implement certain measures to protect national security, certain of which may materially and adversely affect our operating results due to increasing the cost of compliance with security measures, and limiting our control over certain U.S. facilities, contracts, personnel, vendor selection, and operations. If we fail to comply with our obligations under the NSA or other agreements, our ability to operate our business may be adversely affected.

【讨论】：

【解决方案4】：

相当丑陋，但您可以先删除页码和目录链接，然后使用过滤从目标标题及其后续兄弟中删除停止点标题及其后续兄弟。需要 bs4 4.7.1+

for unwanted in soup.select('a:contains("Table of Contents"), div:has(+[style="page-break-after:always"])'):
    unwanted.decompose() #remove table of contents hyperlinks and page numbers


selector = ','.join(['table:has(tr:nth-of-type(2):has(font:contains("Risk Factors")))'
                    ,'table:has(tr:nth-of-type(2):has(font:contains("Risk Factors")))' + \
                     ' ~ *:not(table:has(tr:nth-of-type(2):has(font:contains("Unresolved Staff Comments"))), ' + \
                     'table:has(tr:nth-of-type(2):has(font:contains("Unresolved Staff Comments"))) ~ *)' 
           ])

text = '\n'.join([i.text for i in soup.select(selector)])
print(text, end='\n')

使用变量可能会使代码更容易理解：

for unwanted in soup.select('a:contains("Table of Contents"), div:has(+[style="page-break-after:always"])'):
    unwanted.decompose() #remove table of contents hyperlinks and page numbers

start_header = 'table:has(tr:nth-of-type(2):has(font:contains("Risk Factors")))'
stop_header = 'table:has(tr:nth-of-type(2):has(font:contains("Unresolved Staff Comments")))'

selector = ','.join([start_header,start_header + f' ~ *:not({stop_header}, {stop_header} ~ *)'])

text = '\n'.join([i.text for i in soup.select(selector)])
print(text, end='\n')

您当然可以从目标标头循环兄弟姐妹，直到找到停止标头。

【讨论】：