【问题标题】:Why is my web-scraping returning nothing?为什么我的网络抓取没有返回任何内容?
【发布时间】:2019-06-28 05:26:30
【问题描述】:

我正在尝试使用 Python 从开放站点上的表中进行网络抓取。我已经检查以确保它将使用命令“page_soup.p”连接到该站点,并获得了带有“p”标签的项目的返回。

当我检查以确保我的抓取标签与命令 containers[0] 一起工作时,我遇到了:

Traceback(最近一次通话最后一次)

文件“”,第 1 行,在

IndexError: 列表索引超出范围"

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://overwatchleague.com/en-us/stats'

# opening up connect, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

# html parsing
page_soup = soup(page_html, "html.parser")

# grabs each player
containers = page_soup.findAll("tr",{"class":"Table-row"})

该标签应该有大约 183 行,显然 0 不是我所期望的。对我的不当行为有任何了解吗?

【问题讨论】:

  • 一些 Javascript 库在浏览器中使用该类呈现这些行,页面加载之后。查看页面源代码(即使在浏览器中),您会发现它们不存在,因此 BeautifulSoup 找不到它们。

标签: python web-scraping containers


【解决方案1】:

数据通过 JSON 加载。要找出正确的 URL,请查看例如在 Firefox 开发者工具中,页面的网络连接是什么:

import requests
from datetime import timedelta

url = 'https://api.overwatchleague.com/stats/players?stage_id=regular_season&season=2019'

data = requests.get(url).json()

print('{:^12}{:^12}{:^12}{:^20}'.format('Name', 'Team', 'Deaths', 'Time Played'))
print('-' * (12*3+20))
for row in data['data']:
    print('{:^12}'.format(row['name']), end='')
    print('{:^12}'.format(row['team']), end='')
    print('{:^12.2f}'.format(row['deaths_avg_per_10m']), end='')
    t = timedelta(seconds=float(row['time_played_total']))
    print('{:>20}'.format(str(t)))

打印:

    Name        Team       Deaths       Time Played     
--------------------------------------------------------
    Ado         WAS         5.47         15:23:08.217194
   Adora        HZS         3.72          9:08:57.586787
 Agilities      VAL         5.27         17:16:59.668653
    Aid         TOR         5.08          8:02:19.102897
   AimGod       BOS         4.69         17:04:31.769137
    aKm         DAL         4.64         16:57:14.261245
   alemao       BOS         4.99          2:36:25.171021
   ameng        CDH         6.24         16:06:12.084212
   Anamo        NYE         2.36         17:33:31.143450
 Architect      SFS         4.33          3:18:45.065564
   ArHaN        HOU         6.39          1:54:10.439213
    ArK         WAS         2.50          9:32:57.421203

...and so on.

【讨论】:

    猜你喜欢
    • 2020-09-23
    • 2021-04-14
    • 2015-01-26
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多