【问题标题】:BeautifulSoup: finding nested tagBeautifulSoup:查找嵌套标签
【发布时间】:2020-12-04 13:02:52
【问题描述】:
我对此很坚持:
<span>Alpha<span class="class_xyz">Beta</span></span>
我试图只抓取第一个跨度文本“Alpha”(不包括第二个嵌套的“Beta”)。
你会怎么做?
我正在尝试编写一个函数来查找所有没有类属性的 Span 标签,但有些东西不起作用......
谢谢。
【问题讨论】:
标签:
python
beautifulsoup
nested
tags
screen-scraping
【解决方案1】:
这是另一种获取每个没有类属性的 Span 标签文本的方法:
from bs4 import BeautifulSoup
html = """
<body>
<p>Some random text</p>
<span>Alpha<span class="class_xyz">Beta</span></span>
<span>Gamma<span class="class_abc">Delta</span></span>
<span>Epsilon<span class="class_lmn">Zeta</span></span>
</body>
"""
soup = BeautifulSoup(html)
target = soup.select("span[class]")
for i in range(len(target)):
target[i].decompose()
target = soup.select("span")
out = []
for i in range(len(target)):
out.append(target[i].text.strip())
print(out)
输出:
['Alpha', 'Gamma', 'Epsilon']
或者如果你想要整个 span 标签:
from bs4 import BeautifulSoup
html = """
<body>
<p>Some random text</p>
<span>Alpha<span class="class_xyz">Beta</span></span>
<span>Gamma<span class="class_abc">Delta</span></span>
<span>Epsilon<span class="class_lmn">Zeta</span></span>
</body>
"""
soup = BeautifulSoup(html)
target = soup.select("span[class]")
for i in range(len(target)):
target[i].decompose()
out = soup.select("span")
print(out)
输出:
[<span>Alpha</span>, <span>Gamma</span>, <span>Epsilon</span>]
【解决方案2】:
一种处理方式:
from bs4 import BeautifulSoup as bs
txt = """<doc>
<span>Alpha<span class="class_xyz">Beta</span></span>
</doc>"""
soup = bs(txt,'lxml')
target = soup.select_one('span[class]')
target.decompose()
soup.text.strip()
输出:
'Alpha'