使用 BeautifulSoup 处理没有标签的文本数据答案

【问题标题】：Handle text data without tag using BeautifulSoup使用 BeautifulSoup 处理没有标签的文本数据
【发布时间】：2020-05-28 07:38:50
【问题描述】：

B>DAY</B>, Arbitrator: Under the jurisdiction of the United States
Federal Government and the Federal Aviation Administration, the above
grievance arbitration was submitted to **Joseph L. Daly, Arbitrator**,
on August 15, 2017, at the Federal Aviation Administration South West
Regional Office Central Service Center, Fort Worth, Texas. Prior to
the arbitration hearing, the parties motions were made by the FAA
and NATCA to exclude witnesses from testifying at the arbitration
hearing. The arbitrator denied the motions by a written **decision dated
August 6, 2017**.</P>
<P>The parties filed post-hearing briefs on October 20, 2017. The
Opinion and Award was rendered on October 30, 2017.</P>

上面是我要从中提取决策日期值的数据，对应的仲裁员名称像这里是 Joseph L. Daly

我当前的代码是：-

with open ("file.sgm","r")as f:
contents =f.read()
soup = BeautifulSoup(contents, 'html.parser')
s = soup.find_all('p')
for i in s:
   data = i.text
   print(data)

我可以提取 para 数据，但是现在我应该如何从上述数据中提取相应的值。

【问题讨论】：

** 是在文本中预定义的吗？或者你只是附加了？

标签： python-3.x web-scraping beautifulsoup data-extraction sgml

【解决方案1】：

import re


data = """
B>DAY</B>, Arbitrator: Under the jurisdiction of the United States
Federal Government and the Federal Aviation Administration, the above
grievance arbitration was submitted to **Joseph L. Daly, Arbitrator**,
on August 15, 2017, at the Federal Aviation Administration South West
Regional Office Central Service Center, Fort Worth, Texas. Prior to
the arbitration hearing, the parties motions were made by the FAA
and NATCA to exclude witnesses from testifying at the arbitration
hearing. The arbitrator denied the motions by a written **decision dated
August 6, 2017**.</P>
<P>The parties filed post-hearing briefs on October 20, 2017. The
Opinion and Award was rendered on October 30, 2017.</P>
"""

match = re.findall(r"\*\*([^*]*)\*\*", data)

print(match)

输出：

['Joseph L. Daly, Arbitrator', 'decision dated\nAugust 6, 2017']

【讨论】：