【问题标题】:Trouble Scraping a Table with Python BeautifulSoup使用 Python BeautifulSoup 抓取表格时遇到问题
【发布时间】:2020-01-10 22:05:17
【问题描述】:

我正在尝试从这个网站上抓取表格数据:https://www.playnj.com/atlantic-city/revenue/

然而,当我尝试打印表格时,它返回 None。有人可以帮助我吗?

这是我的代码:

from bs4 import BeautifulSoup
import requests
import pandas as pd
base_url = 'https://www.playnj.com/atlantic-city/revenue/'
resp = requests.get(base_url)
soup = BeautifulSoup(resp.text, "html.parser")
october_table = soup.find('table', {'id': 'tablepress-342-no-2'})
print(october_table)

这返回 None 我不确定为什么 - 理想情况下(也许我在这里错了) - 如果我的目标是从所有表中获取所有数据,那么使用与所有表相同的类包装器会更有效表,我会改用以下 2 行(但可能不会)。

all_tables = soup.findAll('table', {'class': 'dataTables_wrapper no-footer'})
print(all_tables)

但是,这也返回 None。在这里的任何帮助将不胜感激。

【问题讨论】:

  • 您会收到403 Forbidden 的回复。您应该尝试使用所需的标题
  • @Sers 类似于 headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'} resp = requests.get(url, headers=headers) ?
  • 是的,在开始时使用带有User-Agent 的标头。您还可以显示 resp.text 以查看您是否没有收到机器人警告。

标签: python web-scraping beautifulsoup python-requests


【解决方案1】:
import pandas as pd
import requests

headers = {"User-Agent": "Mozilla/5.0"}

df = pd.read_html(requests.get(
    "https://www.playnj.com/atlantic-city/revenue/", headers=headers).text)[0]

df.to_csv("out.csv", index=False)

输出:

          Casino Table & Other       Poker Slot Machines Total Gaming Win
0        Bally's    $3,441,617    $183,255    $9,780,559      $13,405,431
1        Borgata   $16,744,564  $1,631,575   $40,669,801      $59,045,940
2        Caesars   $13,785,260         $ -   $14,530,482      $28,315,742
3  Golden Nugget    $5,237,258     $92,647   $11,728,116      $17,058,021
4      Hard Rock    $7,155,391         $ -   $16,338,090      $23,493,481
5       Harrah's    $5,555,330    $222,323   $19,794,846      $25,572,499
6   Ocean Resort    $4,965,900     $82,686   $14,459,903      $19,508,489
7        Resorts    $3,328,916         $ -   $10,566,342      $13,895,258
8      Tropicana    $4,531,234    $159,957   $18,957,670      $23,648,861
9          Total   $64,745,470  $2,372,443  $156,825,809     $223,943,722

CSV 文件:view-online

【讨论】:

    【解决方案2】:

    看来这个页面检查User-Agent 标头。

    即使"User-Agent": "Mozilla/5.0"不完整也可以使用

    顺便说一句:这个表有不同的 ID:'id': 'tablepress-342'


    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://www.playnj.com/atlantic-city/revenue/'
    r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    print(r.status_code)
    
    soup = BeautifulSoup(r.text, "html.parser")
    
    october_table = soup.find('table', {'id': 'tablepress-342'})
    #print(october_table)
    for row in october_table.find_all('tr'):
        for item in row.find_all('td'):
            print(item.text)
        print('---')
    

    结果

    200
    ---
    Bally's
    $3,799,907 
    $180,229 
    $9,107,610 
    $13,087,746 
    ---
    Borgata
    $14,709,145 
    $1,060,246 
    $35,731,777 
     $51,501,168 
    ---
    Caesars
    $7,097,502 
    $ -
    $14,689,045 
    $21,786,547 
    ---
    Golden Nugget
    $3,311,223 
    $84,387 
    $11,356,285 
    $14,751,895 
    ---
    Hard Rock
    $7,849,617 
    $ -
    $16,619,183 
    $24,468,800 
    ---
    Harrah's
    $4,507,262 
    $205,921 
    $19,372,672 
    $24,085,855 
    ---
    Ocean Resort
    $5,116,397 
    $65,276 
    $13,245,998 
    $18,427,671 
    ---
    Resorts
    $2,257,149 
    $ -
    $9,859,813 
    $12,116,962 
    ---
    Tropicana
    $4,377,139 
    $152,876 
    $17,501,139 
    $22,031,154 
    ---
    Total
    $53,025,341 
    $1,748,935 
    $147,483,522 
    $202,257,798 
    ---
    

    【讨论】:

      【解决方案3】:

      带有标头的请求:

      headers = {
          'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:72.0) Gecko/20100101 Firefox/72.0',
          'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
          'Accept-Language': 'ru-RU,ru;q=0.8,en;q=0.6,en-US;q=0.4,tr;q=0.2',
          'DNT': '1',
          'Connection': 'keep-alive',
          'Upgrade-Insecure-Requests': '1',
      }
      
      resp = requests.get('https://www.playnj.com/atlantic-city/revenue/', headers=headers)
      soup = BeautifulSoup(resp.text, "html.parser")
      tables = soup.select('table.tablepress')
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2015-11-27
        • 1970-01-01
        • 2020-08-02
        • 2021-02-17
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多