BeautifulSoup 由 <br> 拆分答案

【问题标题】：BeautifulSoup split by <br>BeautifulSoup 由 <br> 拆分
【发布时间】：2020-04-22 16:36:33
【问题描述】：

我正在尝试通过 br 标签分割文本。

我有这个标签：

&lt;div class="grseq"&gt;&lt;p class="tigrseq"&gt;&lt;span id="id0-I."&gt;&lt;/span&gt;Section I: Contracting authority&lt;/p&gt;&lt;div class="mlioccur"&gt;&lt;span style="color:black" class="nomark"&gt;&lt;!--Non empty span 2--&gt;I.1)&lt;/span&gt;&lt;span class="timark" style="font-weight:bold;color:black;"&gt;Name and addresses&lt;/span&gt;&lt;div style="color:black" class="txtmark"&gt;Official name: WOBA mbH Oranienburg&lt;br&gt;Postal address: Villacher Straße 2&lt;br&gt;Town: Oranienburg&lt;br&gt;NUTS code: &lt;span class="nutsCode" title="Oberhavel"&gt;DE40A&lt;/span&gt;&lt;br&gt;Postal code: 16515&lt;br&gt;Country: Germany&lt;br&gt;E-mail: &lt;a class="ojsmailto" href="mailto:kordecki@woba.de?subject=TED"&gt;kordecki@woba.de&lt;/a&gt;&lt;p&gt;&lt;b&gt;Internet address(es): &lt;/b&gt;&lt;/p&gt;&lt;p&gt;Main address: &lt;a class="ojshref" href="http://www.woba.de" target="_blank"&gt;www.woba.de&lt;/a&gt;&lt;/p&gt;&lt;/div&gt;&lt;!--//txtmark end--&gt;&lt;/div&gt;&lt;div class="mlioccur"&gt;&lt;span style="color:black" class="nomark"&gt;&lt;!--Non empty span 2--&gt;I.2)&lt;/span&gt;&lt;span class="timark" style="font-weight:bold;color:black;"&gt;Information about joint procurement&lt;/span&gt;&lt;/div&gt;&lt;div class="mlioccur"&gt;&lt;span style="color:black" class="nomark"&gt;&lt;!--Non empty span 2--&gt;I.4)&lt;/span&gt;&lt;span class="timark" style="font-weight:bold;color:black;"&gt;Type of the contracting authority&lt;/span&gt;&lt;div style="color:black" class="txtmark"&gt;Other type: Wohnungswirtschaft&lt;/div&gt;&lt;!--//txtmark end--&gt;&lt;/div&gt;&lt;div class="mlioccur"&gt;&lt;span style="color:black" class="nomark"&gt;&lt;!--Non empty span 2--&gt;I.5)&lt;/span&gt;&lt;span class="timark" style="font-weight:bold;color:black;"&gt;Main activity&lt;/span&gt;&lt;div style="color:black" class="txtmark"&gt;Housing and community amenities&lt;/div&gt;&lt;!--//txtmark end--&gt;&lt;/div&gt;&lt;/div&gt;

我尝试像这样接收每一行的列表：

['Official name: WOBA mbH Oranienburg', 'Postal address: Villacher Straße 2', ...]

这是我的代码：

webpage = 'https://ted.europa.eu/udl?uri=TED:NOTICE:565570-2019:TEXT:EN:HTML&src=0&tabId=0#id1-I.'
webpage_response = requests.get(webpage)
soup = BeautifulSoup(webpage_response.content, 'lxml')
tags = soup.find(class_="mlioccur")
br_tags = tags.text.strip().split('\n\n')
print(br_tags)

我收到的是一个包含一个条目的列表：

['I.1)Name and addressesOfficial name: WOBA mbH OranienburgPostal address: Villacher Straße 2Town: OranienburgNUTS code: DE40APostal code: 16515Country: GermanyE-mail: kordecki@woba.deInternet address(es): Main address: www.woba.de']

如果有任何帮助将不胜感激:)

【问题讨论】：

问得好。这是BeautifulSoup's get_text 函数的文档。这里有类似的问题：extract text between line breaks (e.g. <br /> tags), Extract text with line break

标签： python-3.x text beautifulsoup split tags

【解决方案1】：

您可以使用带有separator= 参数的.get_text() 方法。然后str.split()根据这个分隔符：

txt = '''<div class="grseq"><p class="tigrseq"><span id="id0-I."></span>Section I: Contracting authority</p><div class="mlioccur"><span style="color:black" class="nomark"><!--Non empty span 2-->I.1)</span><span class="timark" style="font-weight:bold;color:black;">Name and addresses</span><div style="color:black" class="txtmark">Official name: WOBA mbH Oranienburg<br>Postal address: Villacher Straße 2<br>Town: Oranienburg<br>NUTS code: <span class="nutsCode" title="Oberhavel">DE40A</span><br>Postal code: 16515<br>Country: Germany<br>E-mail: <a class="ojsmailto" href="mailto:kordecki@woba.de?subject=TED">kordecki@woba.de</a><p><b>Internet address(es): </b></p><p>Main address: <a class="ojshref" href="http://www.woba.de" target="_blank">www.woba.de</a></p></div><!--//txtmark end--></div><div class="mlioccur"><span style="color:black" class="nomark"><!--Non empty span 2-->I.2)</span><span class="timark" style="font-weight:bold;color:black;">Information about joint procurement</span></div><div class="mlioccur"><span style="color:black" class="nomark"><!--Non empty span 2-->I.4)</span><span class="timark" style="font-weight:bold;color:black;">Type of the contracting authority</span><div style="color:black" class="txtmark">Other type: Wohnungswirtschaft</div><!--//txtmark end--></div><div class="mlioccur"><span style="color:black" class="nomark"><!--Non empty span 2-->I.5)</span><span class="timark" style="font-weight:bold;color:black;">Main activity</span><div style="color:black" class="txtmark">Housing and community amenities</div><!--//txtmark end--></div></div>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(txt, 'html.parser')

out = []
for tag in soup.select('.txtmark'):
    out.append(tag.get_text(strip=True, separator='|'))

out = '|'.join(out).replace(':|', ': ').split('|')

from pprint import pprint
pprint(out)

打印：

['Official name: WOBA mbH Oranienburg',
 'Postal address: Villacher Straße 2',
 'Town: Oranienburg',
 'NUTS code: DE40A',
 'Postal code: 16515',
 'Country: Germany',
 'E-mail: kordecki@woba.de',
 'Internet address(es): Main address: www.woba.de',
 'Other type: Wohnungswirtschaft',
 'Housing and community amenities']

【讨论】：

很好的解决方案：一个真正独立的minimal reproducible example，没有请求，但仅依赖于静态HTML intpu txt。在您的代码中记录解决更改的一些 cmets 会很有帮助。