【问题标题】:Grab table from football recruiting website从足球招聘网站抢表
【发布时间】:2021-07-26 23:40:15
【问题描述】:

我想创建与以下网页中显示的完全相同的表:https://247sports.com/college/penn-state/Season/2022-Football/Commits/

我目前正在使用 Selenium 和 Beautiful Soup 开始在 Google Colab 笔记本上实现它,因为我在执行“read_html”命令时遇到了禁止错误。我刚刚开始获得一些输出,但我只想抓取文本而不是围绕它的外部内容。

到目前为止,这是我的代码...

from kora.selenium import wd
from bs4 import BeautifulSoup
import pandas as pd
import time
import datetime as dt
import os
import re
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

url = 'https://247sports.com/college/penn-state/Season/2022-Football/Commits/'
wd.get(url)
time.sleep(5)

soup  = BeautifulSoup(wd.page_source)

school=soup.find_all('span', class_='meta')    
name=soup.find_all('div', class_='recruit')
position = soup.find_all('div', class_="position")
height_weight = soup.find_all('div', class_="metrics")
rating = soup.find_all('span', class_='score')
nat_rank = soup.find_all('a', class_='natrank')
state_rank = soup.find_all('a', class_='sttrank')
pos_rank = soup.find_all('a', class_='posrank')
status = soup.find_all('p', class_='commit-date withDate')

status

...这是我的输出...

[<p class="commit-date withDate"> Commit 7/25/2020  </p>,
 <p class="commit-date withDate"> Commit 9/4/2020  </p>,
 <p class="commit-date withDate"> Commit 1/1/2021  </p>,
 <p class="commit-date withDate"> Commit 3/8/2021  </p>,
 <p class="commit-date withDate"> Commit 10/29/2020  </p>,
 <p class="commit-date withDate"> Commit 7/28/2020  </p>,
 <p class="commit-date withDate"> Commit 9/8/2020  </p>,
 <p class="commit-date withDate"> Commit 8/3/2020  </p>,
 <p class="commit-date withDate"> Commit 5/1/2021  </p>]

非常感谢您对此提供的任何帮助。

【问题讨论】:

    标签: python python-3.x selenium beautifulsoup google-colaboratory


    【解决方案1】:

    无需使用Selenium,从网站获取响应需要指定HTTP User-Agent 标头,否则网站会认为您是机器人并阻止您。

    要创建DataFrame,请参阅此示例:

    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    
    
    url = "https://247sports.com/college/penn-state/Season/2022-Football/Commits/"
    # Add the `user-agent` otherwise we will get blocked when sending the request
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
    }
    
    
    response = requests.get(url, headers=headers).content
    soup = BeautifulSoup(response, "html.parser")
    data = []
    
    for tag in soup.find_all("li", class_="ri-page__list-item")[1:]:  # `[1:]` Since the first result is a table header
        school = tag.find_next("span", class_="meta").text
        name = tag.find_next("a", class_="ri-page__name-link").text
        position = tag.find_next("div", class_="position").text
        height_weight = tag.find_next("div", class_="metrics").text
        rating = tag.find_next("span", class_="score").text
        nat_rank = tag.find_next("a", class_="natrank").text
        state_rank = tag.find_next("a", class_="sttrank").text
        pos_rank = tag.find_next("a", class_="posrank").text
        status = tag.find_next("p", class_="commit-date withDate").text
    
        data.append(
            {
                "school": school,
                "name": name,
                "position": position,
                "height_weight": height_weight,
                "rating": rating,
                "nat_rank": nat_rank,
                "state_rank": state_rank,
                "pos_rank": pos_rank,
                "status": status,
            }
        )
    
    df = pd.DataFrame(data)
    
    print(df.to_string())
    

    输出:

                                                        school            name position height_weight  rating nat_rank state_rank pos_rank                status
    0                  Westerville South (Westerville, OH)      Kaden Saunders      WR    5-10 / 172   0.9509      116          5       16    Commit 7/25/2020  
    1                          IMG Academy (Bradenton, FL)        Drew Shelton      OT     6-5 / 290   0.9468      130         17       14     Commit 9/4/2020  
    2                Central Dauphin East (Harrisburg, PA)       Mehki Flowers      WR     6-1 / 190   0.9461      131          4       18     Commit 1/1/2021  
    3                                  Medina (Medina, OH)          Drew Allar     PRO     6-5 / 220   0.9435      138          6        8     Commit 3/8/2021  
    4                     Manheim Township (Lancaster, PA)        Anthony Ivey      WR     6-0 / 190   0.9249      190          6       26   Commit 10/29/2020  
    5                                 King (Milwaukee, WI)         Jerry Cross      TE     6-6 / 218   0.9153      218          4        8    Commit 7/28/2020  
    6                         Northeast (Philadelphia, PA)          Ken Talley     WDE     6-3 / 230   0.9069      253          9       13     Commit 9/8/2020  
    7                              Central York (York, PA)        Beau Pribula    DUAL     6-2 / 215   0.8891      370         12        9     Commit 8/3/2020  
    8   The Williston Northampton School (Easthampton, MA)       Maleek McNeil      OT     6-8 / 340   0.8593      705          8       64     Commit 5/1/2021  
    

    【讨论】:

    • 这种方法和 Selenium 的主要区别是什么?据我了解,Selenium 是动态加载元素并且 bs 更快的最佳方式。
    • @vitaliis 没错。但是,在这种情况下,页面不是动态加载的,似乎 OP 使用了 Selenium,因为它们被阻止了。因此我们添加了user-agent 以免被阻止。
    • Selenium 还可以添加user-agent。但我同意这种情况下这种方法是最好的。
    • @vitaliis 是的,我的意思是 OP 认为页面是动态加载的,因为他们在发送请求时没有得到响应。
    • 为我工作!非常感谢您的所有帮助。
    猜你喜欢
    • 2013-09-09
    • 2021-09-22
    • 2023-04-02
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多