从网站模糊数据python中抓取数据答案

【问题标题】：Scraping Data from Website obscuring Data python从网站模糊数据python中抓取数据
【发布时间】：2020-12-22 21:38:17
【问题描述】：

我正在尝试从单个 URL 中抓取单个击球数据，这是一个示例 (https://baseballsavant.mlb.com/savant-player/willson-contreras-575929?stats=gamelogs-r-hitting-statcast&season=2020)

好像隐藏了数据或者我无法使用它来获取它

driver = webdriver.Chrome('/Users/gru/Documents/chromedriver')
driver.get('https://baseballsavant.mlb.com/savant-player/willson-contreras-575929?stats=gamelogs-r-hitting-statcast&season=2020')
html_page = driver.page_source
time.sleep(15)
soup = BeautifulSoup(html_page, 'lxml')
for j in soup.find_all('tr'):
    drounders=[]
    for h in j.find_all('td'):
        drounders.append(h.get_text())
    print(drounders)

这是前几行预期的行

Game Date   Bat Team    Fld Team    Pitcher Result  EV (MPH)    LA (°)  Dist (ft)   Direction   Pitch (MPH) Pitch Type  
1   2020-08-12          Carrasco, Carlos    strikeout                           
2   2020-08-12          Carrasco, Carlos    strikeout                           
3   2020-08-12          Carrasco, Carlos    force_out               Opposite            
4   2020-08-11          Allen, Logan    force_out   107.8   -25 5   Pull    94.0    4-Seam Fastball 
5   2020-08-11          Allen, Logan    strikeout                   77.3    Curveball   
6   2020-08-11          Hill, Cam   sac_fly 100.5   42  345 Straightaway    91.6    4-Seam Fastball

【问题讨论】：

您应该查看scrapy。它自动化了很多事情，使网络抓取变得更加容易。

标签： python selenium web-scraping

【解决方案1】：

我在这里看到的唯一问题是 Bat Team 列，因为该列包含图像而不是文本，在这个答案中，我已经从 Bat Team 列中抓取了图像的链接，并且我在最后一个位置添加的那一列，如果你想忽略然后从 for loop 中删除 img

代码：

from selenium import webdriver
from bs4 import BeautifulSoup
import time


site = 'https://baseballsavant.mlb.com/savant-player/willson-contreras-575929?stats=gamelogs-r-hitting-statcast&season=2020'
finalData = []
driver = webdriver.Chrome(executable_path = 'chromedriver.exe') # Here I am using Chrome's web driver
#For Firefox Web driver
#driver = webdriver.Firefox(executable_path = 'geckodriver.exe') 
driver.get(site)
time.sleep(10)
soup = BeautifulSoup(driver.page_source, 'html.parser')
tables = soup.find("div", id = "gamelogs_statcast")
trs = table.find_all("tr")
for trValue in trs:
    data = []
    txt = str(trValue.text)
    img =str(trValue.find("img"))
    data = txt + img
    finalData.append(data)

print(finalData)

输出：

['Game DateBat TeamFld TeamPitcherResultEV (MPH)LA (°)Dist (ft)DirectionPitch (MPH)Pitch TypeNone', '1 2020-08-13   Burnes, Corbin field_out 104.1 24 400 Straightaway 95.7 4-Seam Fastball <img class="table-team-logo" src="https://www.mlbstatic.com/team-logos/112.svg"/>', '2 2020-08-13   Burnes, Corbin walk     89.2 Slider <img class="table-team-logo" src="https://www.mlbstatic.com/team-logos/112.svg"/>', '3 2020-08-13   Anderson, Brett hit_by_pitch     89.5 4-Seam Fastball <img class="table-team-logo" src="https://www.mlbstatic.com/team-logos/112.svg"/>' ........]

希望这会有所帮助，如果此答案需要任何其他帮助，请告诉我。

【讨论】：