如何用漂亮的汤跳过 <span>答案

【问题标题】：How to skip <span> with beautiful soup如何用漂亮的汤跳过 <span>
【发布时间】：2018-07-03 00:31:24
【问题描述】：

这是我的代码的输出

<h1 class="it-ttl" id="itemTitle" itemprop="name"><span class="g-hdn">Details about   </span>item name goes here</h1>

我只想获取项目名称，没有“详细信息”部分。

我的 Python 代码选择特定的 div id 是

for content in soup.select('#itemTitle'):
    print(content.text)

【问题讨论】：

标签： python python-3.x beautifulsoup

【解决方案1】：

您可以使用decompose()clear() 或extract()。根据文档：

Tag.decompose() 从树中删除一个标签，然后完全销毁它及其内容

Tag.clear() 删除标签的内容

PageElement.extract() 从树中删除标签或字符串。它返回被提取的标签或字符串

from bs4 import BeautifulSoup
html = '''<h1 class="it-ttl" id="itemTitle" itemprop="name"><span class="g-hdn">Details about   </span>item name goes here</h1>'''

soup = BeautifulSoup(html, 'lxml')
for content in soup.select('#itemTitle'):
    content.span.decompose()
    print(content.text)

输出：

  item name goes here

【讨论】：

【解决方案2】：

我的回答受到了accepted answer 的启发。

代码：

from bs4 import BeautifulSoup, NavigableString

data = '''
<h1 class="it-ttl" id="itemTitle" itemprop="name"><span class="g-hdn">Details about   </span>item name goes here</h1>
'''

soup = BeautifulSoup(data, 'html.parser')
inner_text = [element for element in soup.h1 if isinstance(element, NavigableString)]
print(inner_text)

输出：

['item name goes here']

【讨论】：

【解决方案3】：

这个怎么样：

from bs4 import BeautifulSoup
html= """<h1 class="it-ttl" id="itemTitle" itemprop="name"><span class="g-hdn">Details about   </span>item name goes here</h1>"""

soup = BeautifulSoup(html, "lxml")

text = soup.find('h1', attrs={"id":"itemTitle"}).text
span = soup.find('span', attrs={"class":"g-hdn"}).text

final_text = text[len(span):]

print(final_text)

这会导致：

item name goes here

【讨论】：

工作:) 简单的解决方案和工作。非常感谢。
谢谢，我认为还有其他一些更好的解决方案，但是如果跨度总是在您正在抓取的内容中排在第一位，这应该是最简单的

【解决方案4】：

试试看是否可行

from bs4 import BeautifulSoup 
soup = BeautifulSoup("""<h1 class="it-ttl" id="itemTitle" itemprop="name">
<span class="g-hdn">Details about  </span>
item name goes here</h1>""")  
print(soup.find('h1', {'class': 'it-ttl'}).contents[-1].strip())

【讨论】：