使用 Python 和 BeautifulSoup 从页面获取表格信息答案

【问题标题】：Getting Table Info From Page Using Python and BeautifulSoup使用 Python 和 BeautifulSoup 从页面获取表格信息
【发布时间】：2020-07-26 08:50:56
【问题描述】：

我试图从中获取信息的页面是https://www.pro-football-reference.com/teams/crd/2017_roster.htm。

我正在尝试从“名册”表中获取所有信息，但由于某种原因，我无法通过 BeautifulSoup 获取它。我尝试过 soup.find("div", {'id': 'div_games_played_team'})，但它不起作用。当我查看页面的 HTML 时，我可以在一个非常大的评论和一个常规的 div 中看到表格。如何使用 BeautifulSoup 从该表中获取信息？

【问题讨论】：

检查that

标签： python html beautifulsoup

【解决方案1】：

你不需要 Selenium。你能做的（并且你正确地识别它）是拉出 cmets，然后从里面解析表格。

import requests
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd


url = 'https://www.pro-football-reference.com/teams/crd/2017_roster.htm'
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))

tables = []
for each in comments:
    if 'table' in each:
        try:
            tables.append(pd.read_html(each)[0])
        except ValueError as e:
            print(e)
            continue

输出：

print(tables[0].head().to_string())
    No.           Player   Age  Pos   G   GS     Wt    Ht  College/Univ  BirthDate   Yrs   AV                            Drafted (tm/rnd/yr)      Salary
0  54.0  Bryson Albright  23.0  NaN   7  0.0  245.0   6-5    Miami (OH)  3/15/1994     1  0.0                                            NaN    $246,177
1  36.0    Budda Baker*+  21.0   ss  16  7.0  195.0  5-10    Washington  1/10/1996  Rook  9.0     Arizona Cardinals / 2nd / 36th pick / 2017    $465,000
2  64.0    Khalif Barnes  35.0  NaN   3  0.0  320.0   6-6    Washington  4/21/1982    12  0.0  Jacksonville Jaguars / 2nd / 52nd pick / 2005    $176,471
3  41.0   Antoine Bethea  33.0   db  15  6.0  206.0  5-11        Howard  7/27/1984    11  4.0   Indianapolis Colts / 6th / 207th pick / 2006  $2,000,000
4  28.0    Justin Bethel  27.0  rcb  16  6.0  200.0   6-0  Presbyterian  6/17/1990     5  3.0    Arizona Cardinals / 6th / 177th pick / 2012  $2,000,000
....

【讨论】：

【解决方案2】：

您尝试抓取的标签是由 JavaScript 动态生成的。您很可能使用请求来抓取您的 HTML。不幸的是，请求不会运行 JavaScript，因为它将所有 HTML 作为原始文本提取。 BeautifulSoup 找不到标签，因为它从未在您的抓取程序中生成。

我建议使用Selenium。这不是一个完美的解决方案 - 只是解决您问题的最佳解决方案。 Selenium WebDriver 将执行 JavaScript 以生成页面的 HTML。然后你可以使用 BeautifulSoup 来解析你想要的东西。请参阅Selenium with Python 以获取有关如何开始的更多帮助。

【讨论】：