【问题标题】:Getting link and scraping table from list从列表中获取链接和抓取表
【发布时间】:2020-01-24 14:48:17
【问题描述】:

我想从不同的玩家那里获得几张牌桌。当它搜索像 Sergio Rodriguez 这样的人时,会出现多个名称 (https://basketball.realgm.com/search?q=Sergio+Rodriguez),因此它不会转到单个页面,而是显示“没有 Sergio Rodriguez 的国际表”。在这三个中,我想进入排名第二的在 NBA 打球的塞尔吉奥·罗德里格斯的个人页面,并刮桌子,但我不知道该怎么做。我如何使用 rel ,因为这是唯一可行的方法。如果有帮助,伪代码就在那里。谢谢。

HTML:

<tbody>
<tr>
<td class="nowrap tablesaw-cell-persist" rel="Rodriguez Febles, Sergio"><a href="/player/Sergio-Rodriguez-Febles/Summary/50443">Sergio Rodriguez Febles</a></td>
<td class="nowrap" rel="5">SF</td>
<td class="nowrap" rel="79">6-7</td>
<td class="nowrap" rel="202">202</td>
<td class="nowrap" rel="19931018"><a href="/info/birthdays/19931018/1">Oct 18, 1993</a></td>
<td class="nowrap" rel="2015"><a href="/nba/draft/past_drafts/2015" target="_blank">2015</a></td>
<td class="nowrap" rel="N/A">-</td>
<td rel="-">-</td>
</tr>
<tr>
<td class="nowrap tablesaw-cell-persist" rel="Rodriguez, Sergio"><a href="/player/Sergio-Rodriguez/Summary/85">Sergio Rodriguez</a></td>
<td class="nowrap" rel="1">PG</td>
<td class="nowrap" rel="75">6-3</td>
<td class="nowrap" rel="176">176</td>
<td class="nowrap" rel="19860612"><a href="/info/birthdays/19860612/1">Jun 12, 1986</a></td>
<td class="nowrap" rel="2006"><a href="/nba/draft/past_drafts/2006" target="_blank">2006</a></td>
<td class="nowrap" rel="N/A">-</td>
<td rel="NYK, PHL, POR, SAC"><a href="/nba/teams/New-York-Knicks/20/Rosters/Regular/2010">NYK</a>, <a href="/nba/teams/Philadelphia-Sixers/22/Rosters/Regular/2017">PHL</a>, <a href="/nba/teams/Portland-Trail-Blazers/24/Rosters/Regular/2009">POR</a>, <a href="/nba/teams/Sacramento-Kings/25/Rosters/Regular/2010">SAC</a></td>
</tr>
<tr>
<td class="nowrap tablesaw-cell-persist" rel="Rodriguez, Sergio"><a href="/player/Sergio-Rodriguez/Summary/39601">Sergio Rodriguez</a></td>
<td class="nowrap" rel="3">SG</td>
<td class="nowrap" rel="76">6-4</td>
<td class="nowrap" rel="-">-</td>
<td class="nowrap" rel="19771012"><a href="/info/birthdays/19771012/1">Oct 12, 1977</a></td>
<td class="nowrap" rel="1999"><a href="/nba/draft/past_drafts/1999" target="_blank">1999</a></td>
<td class="nowrap" rel="N/A">-</td>
<td rel="-">-</td>
</tr>
</tbody>
import requests
from bs4 import BeautifulSoup
import pandas as pd


playernames=['Carlos Delfino', 'Sergio Rodriguez']

result = pd.DataFrame()
for name in playernames:

    fname=name.split(" ")[0]
    lname=name.split(" ")[1]
    url="https://basketball.realgm.com/search?q={}+{}".format(fname,lname)
    response = requests.get(url)

    soup = BeautifulSoup(response.content, 'html.parser')

    # check the response url
    if (response.url == "https://basketball.realgm.com/search..."):
        # parse the search results, finding players who played in NBA
        ... get urls from the table ...
        soup.table...  # etc.
        foreach url in table:
            response = requests.get(player_url)
            soup = BeautifulSoup(response.content, 'html.parser')
            # call the parse function for a player page
            ...
            parse_player(soup)
    else: # we have a player page
        # call the parse function for a player page, same as above
        ...
        parse_player(soup)

    try:
        table1 = soup.find('h2',text='International Regular Season Stats - Per Game').findNext('table')
        table2 = soup.find('h2',text='International Regular Season Stats - Advanced Stats').findNext('table')

        df1 = pd.read_html(str(table1))[0]
        df2 = pd.read_html(str(table2))[0]

        commonCols = list(set(df1.columns) & set(df2.columns))
        df = df1.merge(df2, how='left', on=commonCols)
        df['Player'] = name

    except:
        print ('No international table for %s.' %name)
        df = pd.DataFrame([name], columns=['Player'])

【问题讨论】:

    标签: python html web-scraping beautifulsoup


    【解决方案1】:

    Pandas 有一个非常有用的方法可以直接读取 html。如果您希望从表中获取信息,这将特别有用,这适用于您。基本上,pandas 会在网站上抓取任何表格并将表格作为数据框读取。阅读更多关于它的信息here

    这里的问题是你还需要访问播放器的链接,read_html 方法将表格读取为文本,不考虑标签。

    不过,我找到了一个可能的解决方案。它绝不是最好的,但希望您可以使用和改进它。

    方法是:

    1. 使用read_html方法读取表格
    2. 从表中获取需要的玩家名字(NBA != '-'的玩家)
    3. 可能有多个同名球员 - 假设有 3 个Sergio Rodriguez,但只有第二个打过 NBA - 你需要这个索引,即 index=1(假设起始索引为 0)来查找稍后链接
    4. 为了获取索引,我们在表中查询玩家名称并获取该玩家的索引位置。
    5. 现在我们搜索所有文本为Sergio Rodriguez的链接
    6. 我们只挑选出具有匹配索引的链接,即如果索引为 1(从 0 开始),我们挑选出带有文本 == Sergio Rodriguez 的第二个链接
    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    
    # read the data from the website as a list of dataframes (tables)
    web_data = pd.read_html('https://basketball.realgm.com/search?q=Sergio+Rodriguez')
    
    # the table you need is the second to last one
    required_table = web_data[len(web_data)-2]
    
    print (required_table)
    
    >>>
                        Player Pos   HT   WT    Birth Date  Draft Year College                 NBA
    0  Sergio Rodriguez Febles  SF  6-7  202  Oct 18, 1993        2015       -                   -
    1         Sergio Rodriguez  PG  6-3  176  Jun 12, 1986        2006       -  NYK, PHL, POR, SAC
    2         Sergio Rodriguez  SG  6-4    -  Oct 12, 1977        1999       -                   -
    
    ### get the player name who has played in NBA
    required_player_name = required_table.loc[required_table['NBA']!='-']['Player'].values[0]
    
    print (required_player_name)
    
    >>>
    Sergio Rodriguez
    
    ## check for duplicate players with this name (reset index so that we get the indices of player with the same name in order)
    
    table_with_player = required_table.loc[(required_table['Player']==required_player_name)].reset_index(drop=True)
    
    # get the indices of player where NBA is not '-'
    index_of_player_to_get = list(table_with_player[table_with_player['NBA']!='-'].index)[0]
    
    print (index_of_player_to_get)
    
    
    ### basically if indices_of_player_to_get = 2 (say) then we need the 3rd  link with player name == required_player_name
    
    >>>
    0
    

    现在我们可以读取所有链接,并在名称为 Sergio Rodriguez 的所有链接中拉出index_of_player_to_get 位置的链接

    
    url='https://basketball.realgm.com/search?q=Sergio+Rodriguez'
    response = requests.get(url)
    
    soup = BeautifulSoup(response.content, 'html.parser')
    
    ## get all links
    all_links = soup.find_all('a', href=True)
    
    link_idx = -1
    for link in all_links:
        if link.text == required_player_name:
            # player name found, inc link_idx
            link_idx+=1
            if link_idx == index_of_player_to_get:
                print (link['href'])
    
    >>>
    /player/Sergio-Rodriguez/Summary/85
    

    【讨论】:

    • 您可能必须安装 html5lib 库才能使其正常工作...
    • 知道如何真正进入玩家的个人页面并抓取我需要的两张表吗?另外,我将如何概括 web_data 行,因为我需要几个玩家使用它
    • 看看更新后的答案。这将帮助您访问播放器的个人页面
    【解决方案2】:

    所以,你知道你的rel 总是在表格的第八列,你可以这样做:

    soup = BeautifulSoup(html)
    
    rows = [row for row in soup.find_all('tr')] # Get each row from the table
    
    eighth_text = [col.find_all('td')[7].text for col in rows] # get text from eighth column
    idx = [n for n,i in enumerate(eighth_text) if i!='-'] #Get the index of all rows that have text (are NBA players)
    

    然后您可以通过以下方式访问该(或那些)播放器:

    for i in idx:
        print(rows[i].a)
    

    或您正在寻找的任何属性。可能还有更多的 Pythonic 方式,但我优先考虑易于理解。

    【讨论】:

    • 我正在尝试实际进入播放器的各个页面并抓取表格。任何想法,以及代码的位置?
    猜你喜欢
    • 2017-09-07
    • 1970-01-01
    • 1970-01-01
    • 2021-01-02
    • 2019-03-05
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多