【问题标题】:How to web scrape live data into google sheets如何通过网络将实时数据抓取到谷歌表格中
【发布时间】:2021-01-23 06:21:35
【问题描述】:

我正在尝试创建一个 NHL 投注系统,并且需要每天从网络上抓取实时数据。使用 IMPORTHTML() 函数无法使用我需要抓取的表格。我正在尝试使用 python,但没有为初学者找到好的教程。我需要帮助

>>> from bs4 import BeautifulSoup
>>> import requests
>>> from selenium import webdriver
>>> import pandas as ps
>>> PATH = "C:/webdrivers/chromedriver.exe"
>>> table_name = "table_container"
>>> csv_name = 'nhl_season_stats.csv'
>>> URL = "https://www.hockey-reference.com/leagues/NHL_2021.html"
>>> def get_nhl_stats(URL):
...     driver = webdriver.Chrome(PATH)
...     driver.get(URL)
...     soup = BeautifulSoup(driver.page_source,'html')
...     driver.quit()
...     tables = soup.find_all('table',{"id":[table_name]})
...     for table in tables:
...             tab_name = table['id']
...             tab_data = [[cell.text for cell in row.find_all(["th","td"])]
...                                     for row in table.find_all("tr")]
...             df = pd.DataFrame(tab_data)
...             df.columns = df.iloc[0,:]
...             df.drop(index=0,inplace= True)
...             df.to_csv(csv_name, index = False)
...             print(tab_name)
...             print(df)
...
>>> get_nhl_stats(URL)

我不断收到此错误:

DevTools listening on ws://127.0.0.1:59353/devtools/browser/2ad39b85-94a0- 
4f64-a738-994c69f7373c
[10572:2256:0123/020420.281:ERROR:device_event_log_impl.cc(211)] 
[02:04:20.281] USB: usb_device_handle_win.cc:1049 Failed to read descriptor 
from node connection: A device attached to the system is not functioning. 
(0x1F)
[10572:2256:0123/020420.283:ERROR:device_event_log_impl.cc(211)] 
[02:04:20.283] USB: usb_device_handle_win.cc:1049 Failed to read descriptor 
from node connection: A device attached to the system is not functioning.    
(0x1F)

【问题讨论】:

  • 请提供您已经尝试过的代码。
  • @goalie1998 好的,我做到了
  • @Mason 只是好奇,但为什么要使用 Selenium?您可以简单地使用 1) requestsbeautifulsoup 获取该数据;或 2) pandas,或 3) 使用 nhl.com 上的 api。所有这些选项都比模拟打开浏览器然后解析数据要快。
  • @goalie1998 我从 YouTube 上的某个人那里得到了剧本,我真的不知道我在做什么,我只是想复制他

标签: python web-scraping google-sheets


【解决方案1】:

你的代码发生了什么?

您尝试获取 ID 为 table_container 的所有表,但这是行不通的,因为只有类名为 table_container 的表

如何解决?

在你的问题中不清楚你想抢什么桌子,但我认为是stats 所以在循环之前改变你的变量的值:

table_name = "stats"

关于您的错误

看看这个答案: Failed to read descriptor from node connection: A device attached to the system is not functioning error using ChromeDriver Chrome through Selenium

示例

from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as ps
PATH = "C:/webdrivers/chromedriver.exe"
table_name = "stats"
csv_name = 'nhl_season_stats.csv'
URL = "https://www.hockey-reference.com/leagues/NHL_2021.html"
def get_nhl_stats(URL):
    driver = webdriver.Chrome(PATH)
    driver.get(URL)
    soup = BeautifulSoup(driver.page_source,'html')
    driver.quit()
    tables = soup.find_all('table',{"id":[table_name]})
    
    for table in tables:
            tab_name = table['id']
            tab_data = [[cell.text for cell in row.find_all(["th","td"])]
                                    for row in table.find_all("tr")]
            df = pd.DataFrame(tab_data)
            df.columns = df.iloc[0,:]
            df.drop(index=0,inplace= True)
            df.to_csv(csv_name, index = False)
            print(tab_name)
            print(df)

get_nhl_stats(URL)

输出

0                                                       Special Teams  \
1   Rk                         AvAge  GP  W  L  OL  PTS          PTS%   
2    1     Montreal Canadiens   28.6   5  3  0   2    8          .800   
3    2   Vegas Golden Knights   29.0   4  4  0   0    8         1.000   
4    3    Philadelphia Flyers   27.0   5  3  1   1    7          .700   
5    4          Winnipeg Jets   27.9   4  3  1   0    6          .750   
6    5     New York Islanders   28.9   4  3  1   0    6          .750   
7    6    Toronto Maple Leafs   29.0   5  3  2   0    6          .600   
8    7    Tampa Bay Lightning   27.7   3  3  0   0    6         1.000   

【讨论】:

  • 让我试试看!
  • 保重,我添加了我的chromedriver路径,将在示例中更改它
  • 它的工作!现在我只需要弄清楚如何让它自动将数据刷新到谷歌表格中
【解决方案2】:

我不确定体育参考网站是否“实时”,但它们是最新的。您可以让 pandas 为您完成大部分工作来解析表格。我怀疑您使用的是 Selenium,因为这些表格没有使用简单的requests 在 html 中显示。但是这些表格实际上在 html 的 cmets 中。只需要把它们拉出来:

import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd

URL = 'https://www.hockey-reference.com/leagues/NHL_2021.html'
def get_nhl_stats(URL):
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'}

    pageTree = requests.get(URL, headers=headers)
    pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
    comments = pageSoup.find_all(string=lambda text: isinstance(text, Comment))
    
    tables = []
    for each in comments:
        if 'table' in str(each):
            try:
                tables.append(pd.read_html(str(each), header=1)[0])
            except:
                continue
    
    df = tables[0]
    df = df.rename(columns={'Unnamed: 1':'Team'})
    print(df)

get_nhl_stats(URL)

输出:

print(df.to_string())
      Rk                   Team  AvAge  GP  W  L  OL  PTS   PTS%  GF  GA  SOW  SOL   SRS   SOS  TG/G  EVGF  EVGA  PP  PPO    PP%  PPA  PPOA     PK%  SH  SHA  PIM/G  oPIM/G    S    S%   SA    SV%  SO
0    1.0    Toronto Maple Leafs   29.0   6  4  2   0    8  0.667  19  17  0.0  0.0  0.33 -0.01  6.00    11    12   8   18  44.44    4    22   81.82   0    1   10.5     7.5  190  10.0  157  0.892   0
1    2.0     Montreal Canadiens   28.6   5  3  0   2    8  0.800  24  15  0.0  1.0  0.77 -0.83  7.80    14     8   6   20  30.00    6    25   76.00   4    1   11.4    10.6  180  13.3  140  0.893   0
2    3.0   Vegas Golden Knights   28.9   5  4  1   0    8  0.800  18  12  0.0  0.0  1.12 -0.08  6.00    15     8   2   18  11.11    3    18   83.33   1    1    7.2     7.2  150  12.0  125  0.904   0
3    4.0         Minnesota Wild   29.1   5  4  1   0    8  0.800  15  10  0.0  0.0  0.86 -0.14  5.00    13     9   1   23   4.35    1    16   93.75   1    0    7.6    10.4  166   9.0  147  0.932   0
4    5.0    Washington Capitals   30.1   5  3  0   2    8  0.800  18  16  1.0  1.0  0.10 -0.30  6.80    16    12   2    9  22.22    3    18   83.33   0    1    8.6     5.0  130  13.8  141  0.887   0
5    6.0    Philadelphia Flyers   27.0   5  3  1   1    7  0.700  19  15  0.0  1.0  0.36 -0.24  6.80    14    10   5   17  29.41    5    18   72.22   0    0    7.2     6.8  125  15.2  187  0.920   1
6    7.0     Colorado Avalanche   26.9   5  3  2   0    6  0.600  17  12  0.0  0.0  0.47 -0.53  5.80     7     9  10   25  40.00    3    19   84.21   0    0    8.0    10.4  147  11.6  143  0.916   1
7    8.0          Winnipeg Jets   27.9   4  3  1   0    6  0.750  13  10  0.0  0.0  1.10  0.35  5.75    11     6   2   20  10.00    4    12   66.67   0    0   10.3    14.3  119  10.9  134  0.925   0
8    9.0     New York Islanders   28.9   4  3  1   0    6  0.750   9   6  0.0  0.0  0.61 -0.14  3.75     5     5   4   20  20.00    1    15   93.33   0    0   11.5    11.0  108   8.3  114  0.947   2
9   10.0    Tampa Bay Lightning   27.7   3  3  0   0    6  1.000  13   5  0.0  0.0  1.70 -0.97  6.00    11     2   2    8  25.00    3    11   72.73   0    0    9.0     7.0  107  12.1   85  0.941   0
10  11.0    Pittsburgh Penguins   28.6   5  3  2   0    6  0.600  16  21  2.0  0.0 -0.43  0.17  7.40    10    16   5   18  27.78    5    19   73.68   1    0    7.6     7.2  152  10.5  130  0.838   0
11  12.0      New Jersey Devils   26.2   4  2  1   1    5  0.625   9  10  0.0  1.0 -0.35  0.15  4.75     8     3   1   11   9.09    6    16   62.50   0    1    9.8     7.3  112   8.0  150  0.933   0
12  13.0        St. Louis Blues   28.3   4  2  1   1    5  0.625  10  14  0.0  1.0 -1.66 -0.41  6.00    10     6   0   14   0.00    8    21   61.90   0    0   11.0     7.5  109   9.2  129  0.891   0
13  14.0          Boston Bruins   28.8   4  2  1   1    5  0.625   7   9  2.0  0.0  0.07  0.07  4.00     3     7   3   13  23.08    2    18   88.89   1    0   11.3     8.8  135   5.2   96  0.906   0
14  15.0        Arizona Coyotes   28.4   5  2  2   1    5  0.500  17  17  0.0  1.0 -0.04  0.16  6.80    11    11   5   22  22.73    5    24   79.17   1    1   10.4     9.6  144  11.8  157  0.892   0
15  16.0         Calgary Flames   28.1   3  2  0   1    5  0.833  11   6  0.0  0.0  1.14 -0.52  5.67     5     4   6   16  37.50    1    12   91.67   0    1    8.7    11.3   93  11.8   93  0.935   1
16  17.0        Edmonton Oilers   27.9   6  2  4   0    4  0.333  15  20  0.0  0.0 -0.91 -0.08  5.83    10    14   3   23  13.04    4    18   77.78   2    2    7.7     9.3  192   7.8  200  0.900   0
17  18.0      Vancouver Canucks   27.3   6  2  4   0    4  0.333  17  28  1.0  0.0 -1.34  0.33  7.50    12    17   4   26  15.38    9    31   70.97   1    2   13.3    10.7  179   9.5  222  0.874   0
18  19.0          Anaheim Ducks   28.6   5  1  2   2    4  0.400   8  13  0.0  0.0 -0.10  0.90  4.20     8    10   0   12   0.00    2    15   86.67   0    1    6.4     5.2  133   6.0  160  0.919   1
19  20.0  Columbus Blue Jackets   26.6   5  1  2   2    4  0.400  10  16  0.0  0.0 -1.19  0.01  5.20     9    15   1   11   9.09    1    10   90.00   0    0    9.0     9.4  152   6.6  169  0.905   0
20  21.0      Los Angeles Kings   28.3   4  1  1   2    4  0.500  12  13  0.0  0.0  0.43  0.68  6.25     8    10   4   17  23.53    3    21   85.71   0    0   11.0     9.0  119  10.1  121  0.893   0
21  22.0      Detroit Red Wings   29.3   5  2  3   0    4  0.400  10  14  0.0  0.0 -1.54 -0.74  4.80     9     9   1   12   8.33    4    16   75.00   0    1   11.4     9.8  130   7.7  155  0.910   0
22  23.0        San Jose Sharks   29.4   5  2  3   0    4  0.400  12  18  2.0  0.0 -1.32 -0.52  6.00     7    16   5   21  23.81    2    18   88.89   0    0    8.4     9.6  162   7.4  148  0.878   0
23  24.0    Carolina Hurricanes   27.0   3  2  1   0    4  0.667   9   6  0.0  0.0  0.26 -0.74  5.00     6     5   3   12  25.00    1     9   88.89   0    0    7.7     9.7   98   9.2   68  0.912   1
24  25.0       Florida Panthers   27.8   2  2  0   0    4  1.000  10   6  0.0  0.0  1.29 -0.71  8.00     7     3   3    8  37.50    3     5   40.00   0    0    5.0     8.0   66  15.2   66  0.909   0
25  26.0    Nashville Predators   28.7   4  2  2   0    4  0.500  10  14  0.0  0.0  0.01  1.01  6.00     9     7   1   16   6.25    6    16   62.50   0    1    8.0     8.0  135   7.4  126  0.889   0
26  27.0         Buffalo Sabres   27.2   5  1  3   1    3  0.300  14  15  0.0  1.0 -0.18  0.22  5.80    11    14   3   17  17.65    1     6   83.33   0    0    3.8     8.2  161   8.7  133  0.887   0
27  28.0       New York Rangers   25.6   4  1  2   1    3  0.375  11  11  0.0  1.0 -0.15  0.11  5.50     7     7   4   21  19.05    4    16   75.00   0    0    8.5    14.0  140   7.9  112  0.902   1
28  29.0     Chicago Blackhawks   26.9   5  1  3   1    3  0.300  13  21  0.0  0.0 -0.43  1.17  6.80     5    16   7   17  41.18    5    20   75.00   1    0    8.0     6.8  154   8.4  167  0.874   0
29  30.0        Ottawa Senators   27.0   4  1  2   1    3  0.375  11  14  0.0  0.0 -0.04  0.71  6.25     8    10   3   18  16.67    4    21   80.95   0    0   14.3    15.3  113   9.7  120  0.883   0
30  31.0           Dallas Stars   28.8   1  1  0   0    2  1.000   7   0  0.0  0.0  7.30  0.30  7.00     1     0   5    8  62.50    0     5  100.00   1    0   10.0    16.0   28  25.0   34  1.000   1
31   NaN         League Average   28.0   4  2  2   1    5  0.574  13  13  NaN  NaN   NaN   NaN  5.94     9     9   4   16  21.33    4    16   78.67   0    0    8.0     8.0  133   9.8  133  0.902   0

【讨论】:

  • 评论不用于扩展讨论;这个对话是moved to chat
  • @Mason,经过调查,我认为它是beautifulsoup 版本。现在拉出 cmets 时出现,它们是 beautfulsoup 对象,而不是字符串。所以只需要修复该代码中的两行。上面的代码更新了。
猜你喜欢
  • 1970-01-01
  • 2022-11-10
  • 2019-09-18
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2022-10-31
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多