使用python的带有隐藏部分的Web抓取表答案

【问题标题】：Web scraping table with hidden part using python使用python的带有隐藏部分的Web抓取表
【发布时间】：2017-02-28 10:01:47
【问题描述】：

我正在尝试从该表中获取信息：

<table class="table4 table4-1 table4-1-1"><thead><tr><th class="estilo1">No</th><th class="estilo2">Si</th><!--                                                        <th><div class="contenedor-vinculos6"><a title="Ver más " class="vinculo-interrogacion" href="#">Más información</a></div></th>--></tr></thead><tbody><tr><td class="estilo1"><span class="estilo3">100%<span class="numero-voto">(15)</span></span><div class="grafica1 grafica1-desacuerdo"><div class="item-grafica" style="width: 100%;"/></div></div></td><td class="estilo2"><span class="estilo3">0%<span class="numero-voto">(0)</span></span><div class="grafica1 grafica1-deacuerdo"><div class="item-grafica" style="width: 0%;"/></div></div></td><td><span class="display-none">Más información</span></td></tr></tbody></table>

我在 python3 中执行以下操作：

req = Request('http://www.congresovisible.org/votaciones/10918/',headers=headers)
web_page = urlopen(req)
soup = BeautifulSoup(web_page.read(), 'html.parser')
table= soup.find_all('table', attrs={'class':'table4 table4-1 table4-1-1'})

这有效，但只显示表格的一部分，它排除了之后的所有内容：

<td class="estilo2"><span class="estilo3...)

这是输出

[<table class="table4 table4-1 table4-1-1"><thead><tr><th class="estilo1">No</th><th class="estilo2">Si</th><!--                                                        <th><div class="contenedor-vinculos6"><a title="Ver más " class="vinculo-interrogacion" href="#">Más información</a></div></th>--></tr></thead><tbody><tr><td class="estilo1"><span class="estilo3">100%<span class="numero-voto">(15)</span></span><div class="grafica1 grafica1-desacuerdo"><div class="item-grafica" style="width: 100%;"></div></div></td></tr></tbody></table>]

如何提取整个表格？

【问题讨论】：

标签： python web-scraping beautifulsoup html-table

【解决方案1】：

其实很容易解决。 html.parser 不能很好地解析这种格式不正确的 HTML。改用 更宽松 html5lib。这对我有用：

import requests
from bs4 import BeautifulSoup

response = requests.get('http://www.congresovisible.org/votaciones/10918/')
soup = BeautifulSoup(response.content, 'html5lib')
table = soup.find_all('table', attrs={'class':'table4 table4-1 table4-1-1'})
print(table)

请注意，这需要安装html5lib 包：

pip install --upgrade html5lib

顺便说一句，lxml 解析器也可以工作：

soup = BeautifulSoup(response.content, 'lxml')

【讨论】：

我收到此错误。找不到具有您要求的功能的树生成器：html5lib。您是否需要安装解析器库。与 lxml 是相同的错误。你知道可能是什么原因吗？
@user2246905 这正是我在答案中添加注释的原因 - 无论您选择坚持什么，您都需要安装 html5lib 或 lxml。希望对您有所帮助。
我已经安装了这两个。 import html5lib 没有错误但是好像没有安装好什么的
@user2246905 确保已将它们安装到运行脚本的同一 python 环境中。
我检查过，它们处于正确的环境中。当我做 html5lib.__version__ 它显示'0.999'。你觉得不是按照这个stackoverflow.com/questions/39086278/…安装最后一个版本吗？