使用 BeautifulSoup/Python 从 html 文件中提取文本答案

【问题标题】：Extract text from html file with BeautifulSoup/Python使用 BeautifulSoup/Python 从 html 文件中提取文本
【发布时间】：2019-06-20 17:50:22
【问题描述】：

我正在尝试从 html 文件中提取文本。 html 文件如下所示：

<li class="toclevel-1 tocsection-1">
    <a href="#Baden-Württemberg"><span class="tocnumber">1</span>
        <span class="toctext">Baden-Württemberg</span>
    </a>
</li>
<li class="toclevel-1 tocsection-2">
    <a href="#Bayern">
        <span class="tocnumber">2</span>
        <span class="toctext">Bayern</span>
    </a>
</li>
<li class="toclevel-1 tocsection-3">
    <a href="#Berlin">
        <span class="tocnumber">3</span>
        <span class="toctext">Berlin</span>
    </a>
</li>

我想从最后一个spantag 中提取最后一个文本。在第一行中，class="toctext" 之后将是“Baden-Würtemberg”，然后将其放入 python 列表。

在 Python 中我尝试了以下方法：

names = soup.find_all("span",{"class":"toctext"})

我的输出是list:

[<span class="toctext">Baden-Württemberg</span>, <span class="toctext">Bayern</span>, <span class="toctext">Berlin</span>]

那么我怎样才能只提取标签之间的文本呢？

谢谢大家

【问题讨论】：

标签： python html beautifulsoup

【解决方案1】：

find_all 方法返回一个列表。遍历列表以获取文本。

for name in names:
    print(name.text)

Baden-Württemberg
Bayern
Berlin

python 内置的dir() 和type() 方法总是便于检查对象。

print(dir(names))

[...,
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort',
 'source']

【讨论】：

【解决方案2】：

使用理解列表，您可以执行以下操作：

names = soup.find_all("span",{"class":"toctext"})
print([x.text for x in names])

【讨论】：