Beautiful Soup 获取动态表数据

【问题标题】：Beautiful Soup fetch dynamic table dataBeautiful Soup 获取动态表数据
【发布时间】：2018-02-02 16:11:58
【问题描述】：

我有以下代码：

url = 'https://www.basketball-reference.com/leagues/NBA_2017_standings.html#all_expanded_standings'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')

print(len(soup.findAll('table')))
print(soup.findAll('table'))

网页上有 6 个表格，但它只返回 4 个表格。我尝试使用 'html.parser' 或 'html5lib' 作为解析器，但也没有用。

知道如何从网页中获取表格“扩展排名”吗？

谢谢！

【问题讨论】：

其余由JS加载。
什么意思？你知道我如何访问它吗？
您可以使用 selenium 访问其余部分。

标签： python parsing web-scraping beautifulsoup lxml

【解决方案1】：

requests 无法获取JS 加载的数据。所以，你必须使用selenium。首先通过pip - pip install selenium 安装selenium 并下载chrome driver 并将文件放入您的工作目录。那就试试下面的代码吧。

from bs4 import BeautifulSoup
import time
from selenium import webdriver

url = "https://www.basketball-reference.com/leagues/NBA_2017_standings.html"
browser = webdriver.Chrome()

browser.get(url)
time.sleep(3)
html = browser.page_source
soup = BeautifulSoup(html, "lxml")

print(len(soup.find_all("table")))
print(soup.find("table", {"id": "expanded_standings"}))

browser.close()
browser.quit()

见seleniumdocumentation。

如果您使用Linux 并收到错误Chromedriver executable needs to be in the PATH，请尝试以下方式 - link-1、link-2

【讨论】：