【问题标题】:Scrape Tables on Multiple Pages with Single URL使用单个 URL 在多个页面上抓取表格
【发布时间】:2020-07-31 06:45:44
【问题描述】:

我正在尝试从 Fangraphs 中抓取数据。这些表分为 21 页,但所有页面都使用相同的 url。我对网络抓取(或一般的 Python)非常陌生,但 Fangraphs 没有公共 API,所以抓取页面似乎是我唯一的选择。我目前正在使用 BeautifulSoup 来解析 HTML 代码,并且我能够抓取初始表,但它只包含前 30 个玩家,但我想要整个玩家池。两天的网络搜索,我被困住了。链接和我当前的代码如下。我知道他们有一个下载 csv 文件的链接,但这在整个赛季都会变得乏味,我想加快数据收集过程。任何方向都会有所帮助,谢谢。

https://www.fangraphs.com/projections.aspx?pos=all&stats=bat&type=fangraphsdc

import requests
import pandas as pd

url = 'https://www.fangraphs.com/projections.aspx?pos=all&stats=bat&type=fangraphsdc&team=0&lg=all&players=0'

response = requests.get(url, verify=False)

# Use BeautifulSoup to parse the HTML code
soup = BeautifulSoup(response.content, 'html.parser')

# changes stat_table from ResultSet to a Tag
stat_table = stat_table[0]

# Convert html table to list
rows = []
for tr in stat_table.find_all('tr')[1:]:
    cells = []
    tds = tr.find_all('td')
    if len(tds) == 0:
        ths = tr.find_all('th')
        for th in ths:
            cells.append(th.text.strip())
    else:
        for td in tds:
            cells.append(td.text.strip())
    rows.append(cells)

# convert table to df
table = pd.DataFrame(rows)

【问题讨论】:

    标签: python url web-scraping beautifulsoup


    【解决方案1】:
    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    params = {
        "pos": "all",
        "stats": "bat",
        "type": "fangraphsdc"
    }
    
    data = {
        'RadScriptManager1_TSM': 'ProjectionBoard1$dg1',
        "__EVENTTARGET": "ProjectionBoard1$dg1",
        '__EVENTARGUMENT': 'FireCommand:ProjectionBoard1$dg1$ctl00;PageSize;1000',
        '__VIEWSTATEGENERATOR': 'C239D6F0',
        '__SCROLLPOSITIONX': '0',
        '__SCROLLPOSITIONY': '1366',
        "ProjectionBoard1_tsStats_ClientState": "{\"selectedIndexes\":[\"0\"],\"logEntries\":[],\"scrollState\":{}}",
        "ProjectionBoard1_tsPosition_ClientState": "{\"selectedIndexes\":[\"0\"],\"logEntries\":[],\"scrollState\":{}}",
        "ProjectionBoard1$rcbTeam": "All+Teams",
        "ProjectionBoard1_rcbTeam_ClientState": "",
        "ProjectionBoard1$rcbLeague": "All",
        "ProjectionBoard1_rcbLeague_ClientState": "",
        "ProjectionBoard1_tsProj_ClientState": "{\"selectedIndexes\":[\"5\"],\"logEntries\":[],\"scrollState\":{}}",
        "ProjectionBoard1_tsUpdate_ClientState": "{\"selectedIndexes\":[],\"logEntries\":[],\"scrollState\":{}}",
        "ProjectionBoard1$dg1$ctl00$ctl02$ctl00$PageSizeComboBox": "30",
        "ProjectionBoard1_dg1_ctl00_ctl02_ctl00_PageSizeComboBox_ClientState": "",
        "ProjectionBoard1$dg1$ctl00$ctl03$ctl01$PageSizeComboBox": "1000",
        "ProjectionBoard1_dg1_ctl00_ctl03_ctl01_PageSizeComboBox_ClientState": "{\"logEntries\":[],\"value\":\"1000\",\"text\":\"1000\",\"enabled\":true,\"checkedIndices\":[],\"checkedItemsTextOverflows\":false}",
        "ProjectionBoard1_dg1_ClientState": ""
    }
    
    
    def main(url):
        with requests.Session() as req:
            r = req.get(url, params=params)
            soup = BeautifulSoup(r.content, 'html.parser')
            data['__VIEWSTATE'] = soup.find("input", id="__VIEWSTATE").get("value")
            data['__EVENTVALIDATION'] = soup.find(
                "input", id="__EVENTVALIDATION").get("value")
            r = req.post(url, params=params, data=data)
            df = pd.read_html(r.content, attrs={
                              'id': 'ProjectionBoard1_dg1_ctl00'})[0]
            df.drop(df.columns[1], axis=1, inplace=True)
            print(df)
            df.to_csv("data.csv", index=False)
    
    
    main("https://www.fangraphs.com/projections.aspx")
    

    输出:view-online

    【讨论】:

    • 谢谢,第一次尝试,还是只拿到了前30名玩家。但是下次我尝试它时,我得到了一切。我确信它已经结束了。现在我只需要弄清楚这段代码在做什么,这样我就可以从这个站点的其他页面中提取数据。再次感谢您的帮助。
    • @dpeters555 你在我编辑的时候复制了它。现在它应该可以 100% 工作了。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2021-12-13
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-04-27
    • 1970-01-01
    相关资源
    最近更新 更多