【问题标题】:Pandas - dealing with empty cellsPandas - 处理空单元格
【发布时间】:2018-02-18 06:49:17
【问题描述】:

我在使用 beautifulsoup 将足球运动员的详细信息刮到一个可行的 Pandas 表中时遇到了很大的困难。

问题是我抓取的一些数据是“额外的”并且用废话填满了我的表格行。例如:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101 Firefox/20.0"}

page = requests.get('https://www.transfermarkt.co.uk/manchester-united/startseite/verein/985', headers=HEADERS)
soup = BeautifulSoup(page.content, 'html.parser')

playerdata = soup.find_all(class_='posrela')
names = [';'.join(pt.findAll(text=True)) for pt in playerdata]

df = pd.DataFrame(names)
df = pd.DataFrame([sub.split(";") for sub in names])

print(df.replace('^$', np.nan, regex=True))

结果:

 python testing5.py
                     0               1                   2                   3
0         David de Gea       D. de Gea              Keeper                None
1        Sergio Romero       S. Romero              Keeper                None
2         Joel Pereira      J. Pereira              Keeper                None
3          Eric Bailly       E. Bailly                             Centre-Back
4      Victor Lindelöf     V. Lindelöf         Centre-Back                None
5          Marcos Rojo         M. Rojo                             Centre-Back
6       Chris Smalling     C. Smalling         Centre-Back                None
7           Phil Jones        P. Jones                             Centre-Back
8          Daley Blind        D. Blind           Left-Back                None
9            Luke Shaw       Luke Shaw           Left-Back                None
10      Matteo Darmian      M. Darmian          Right-Back                None
11    Antonio Valencia     A. Valencia          Right-Back                None
12       Nemanja Matic        N. Matic  Defensive Midfield                None
13     Michael Carrick      M. Carrick                      Defensive Midfield
14          Paul Pogba        P. Pogba    Central Midfield                None
15       Ander Herrera      A. Herrera    Central Midfield                None
16   Marouane Fellaini     M. Fellaini    Central Midfield                None
17        Ashley Young        A. Young       Left Midfield                None
18  Henrikh Mkhitaryan   H. Mkhitaryan  Attacking Midfield                None
19           Juan Mata       Juan Mata  Attacking Midfield                None
20       Jesse Lingard      J. Lingard           Left Wing                None
21       Romelu Lukaku       R. Lukaku      Centre-Forward                None
22     Anthony Martial      A. Martial                   .      Centre-Forward
23     Marcus Rashford     M. Rashford      Centre-Forward                None
24  Zlatan Ibrahimovic  Z. Ibrahimovic                          Centre-Forward

如您所见,在我抓取空数据的地方,它会将数据推送到错误的单元格中。您可能会问为什么我有第 4 列,我将在其中插入更多数据,但现在我需要清理第 3 列。

如您所见,我首先尝试使用正则表达式将空格替换为 NaN。但无论我尝试什么,我似乎都无法“选择”空单元格。我联系不上他们!

当我尝试将“名称”视为列表时,解释器告诉我这不是列表而是结果集!

想知道是否有人可以提供帮助,作为一个编程菜鸟,我已经取得了很大的进步,但已经碰壁了。

【问题讨论】:

    标签: python pandas beautifulsoup


    【解决方案1】:

    您可以使用后处理 - 将第 3 列到第 2 列中的非 NaN 替换为 locnotnull

    df.loc[df[3].notnull(), 2] = df[3]
    #remove column 3
    df = df.drop(3, axis=1)
    

    另一个解决方案是mask:

    df[2] = df[2].mask(df[3].notnull(), df[3])
    df = df.drop(3, axis=1)
    

    或与numpy.where非常相似:

    df[2] = np.where(df[3].notnull(), df[3], df[2])
    df = df.drop(3, axis=1)
    

    我尝试改进一下您的解决方案:

    playerdata = soup.find_all(class_='posrela')
    names = [list(pt.findAll(text=True)) for pt in playerdata]
    df = pd.DataFrame(names)
    df.loc[df[3].notnull(), 2] = df[3]
    df = df.drop(3, axis=1)
    print (df)
    
                         0               1                   2
    0         David de Gea       D. de Gea              Keeper
    1        Sergio Romero       S. Romero              Keeper
    2         Joel Pereira      J. Pereira              Keeper
    3          Eric Bailly       E. Bailly         Centre-Back
    4      Victor Lindelöf     V. Lindelöf         Centre-Back
    5          Marcos Rojo         M. Rojo         Centre-Back
    6       Chris Smalling     C. Smalling         Centre-Back
    7           Phil Jones        P. Jones         Centre-Back
    8          Daley Blind        D. Blind           Left-Back
    9            Luke Shaw       Luke Shaw           Left-Back
    10      Matteo Darmian      M. Darmian          Right-Back
    11    Antonio Valencia     A. Valencia          Right-Back
    12       Nemanja Matic        N. Matic  Defensive Midfield
    13     Michael Carrick      M. Carrick  Defensive Midfield
    14          Paul Pogba        P. Pogba    Central Midfield
    15       Ander Herrera      A. Herrera    Central Midfield
    16   Marouane Fellaini     M. Fellaini    Central Midfield
    17        Ashley Young        A. Young       Left Midfield
    18  Henrikh Mkhitaryan   H. Mkhitaryan  Attacking Midfield
    19           Juan Mata       Juan Mata  Attacking Midfield
    20       Jesse Lingard      J. Lingard           Left Wing
    21       Romelu Lukaku       R. Lukaku      Centre-Forward
    22     Anthony Martial      A. Martial      Centre-Forward
    23     Marcus Rashford     M. Rashford      Centre-Forward
    24  Zlatan Ibrahimovic  Z. Ibrahimovic      Centre-Forward
    

    另一种解决方案:

    playerdata = soup.find_all(class_='posrela')
    
    names = []
    for pt in playerdata:
       L = list(pt.findAll(text=True))
       #check length of list
       if len(L) == 4:
          #assign 4. value to 3. 
          L[2] = L[3]
       #appenf first 3 values in list 
       names.append(L[:3])
    
    df = pd.DataFrame(names)
    

    print (df)
                         0               1                   2
    0         David de Gea       D. de Gea              Keeper
    1        Sergio Romero       S. Romero              Keeper
    2         Joel Pereira      J. Pereira              Keeper
    3          Eric Bailly       E. Bailly         Centre-Back
    4      Victor Lindelöf     V. Lindelöf         Centre-Back
    5          Marcos Rojo         M. Rojo         Centre-Back
    6       Chris Smalling     C. Smalling         Centre-Back
    7           Phil Jones        P. Jones         Centre-Back
    8          Daley Blind        D. Blind           Left-Back
    9            Luke Shaw       Luke Shaw           Left-Back
    10      Matteo Darmian      M. Darmian          Right-Back
    11    Antonio Valencia     A. Valencia          Right-Back
    12       Nemanja Matic        N. Matic  Defensive Midfield
    13     Michael Carrick      M. Carrick  Defensive Midfield
    14          Paul Pogba        P. Pogba    Central Midfield
    15       Ander Herrera      A. Herrera    Central Midfield
    16   Marouane Fellaini     M. Fellaini    Central Midfield
    17        Ashley Young        A. Young       Left Midfield
    18  Henrikh Mkhitaryan   H. Mkhitaryan  Attacking Midfield
    19           Juan Mata       Juan Mata  Attacking Midfield
    20       Jesse Lingard      J. Lingard           Left Wing
    21       Romelu Lukaku       R. Lukaku      Centre-Forward
    22     Anthony Martial      A. Martial      Centre-Forward
    23     Marcus Rashford     M. Rashford      Centre-Forward
    24  Zlatan Ibrahimovic  Z. Ibrahimovic      Centre-Forward
    

    【讨论】:

    • 谢谢 - 效果非常好。现在让我了解它。我以前用过 loc 来挑选出“细胞”,但其余的需要一些思考。 **刚刚注意到您的修改,再次感谢。
    • 很高兴可以提供帮助,我也稍微改进了您的解决方案并添加了另一个。周末愉快!
    【解决方案2】:

    如果您要提取更多数据,我建议您按照易于放入数据框的顺序提取所有数据。除非您以正确的格式提取数据,否则您将不得不不断地运行不必要的清理操作

    playerdata = soup.find_all(class_='inline-table')
    
    names = [[x.find('img')['title'],
             x.find_all(class_='spielprofil_tooltip')[-1].renderContents(),
             x.find_all('tr')[-1].find('td').renderContents()] for x in playerdata]
    
    df = pd.DataFrame(names,columns=['Name','Short','Position'])
    
    
                      Name            Short            Position
    0         David de Gea        D. de Gea              Keeper
    1        Sergio Romero        S. Romero              Keeper
    2         Joel Pereira       J. Pereira              Keeper
    3          Eric Bailly        E. Bailly         Centre-Back
    4      Victor Lindelöf      V. Lindelöf         Centre-Back
    5          Marcos Rojo          M. Rojo         Centre-Back
    6       Chris Smalling      C. Smalling         Centre-Back
    7           Phil Jones         P. Jones         Centre-Back
    8          Daley Blind         D. Blind           Left-Back
    9            Luke Shaw        Luke Shaw           Left-Back
    10      Matteo Darmian       M. Darmian          Right-Back
    11    Antonio Valencia      A. Valencia          Right-Back
    12       Nemanja Matic         N. Matic  Defensive Midfield
    13     Michael Carrick       M. Carrick  Defensive Midfield
    14          Paul Pogba         P. Pogba    Central Midfield
    15       Ander Herrera       A. Herrera    Central Midfield
    16   Marouane Fellaini      M. Fellaini    Central Midfield
    17        Ashley Young         A. Young       Left Midfield
    18  Henrikh Mkhitaryan    H. Mkhitaryan  Attacking Midfield
    19           Juan Mata        Juan Mata  Attacking Midfield
    20       Jesse Lingard       J. Lingard           Left Wing
    21       Romelu Lukaku        R. Lukaku      Centre-Forward
    22     Anthony Martial       A. Martial      Centre-Forward
    23     Marcus Rashford      M. Rashford      Centre-Forward
    24  Zlatan Ibrahimovic   Z. Ibrahimovic      Centre-Forward
    25       Romelu Lukaku    Romelu Lukaku      Centre-Forward
    26          Paul Pogba       Paul Pogba    Central Midfield
    27     Anthony Martial  Anthony Martial      Centre-Forward
    28     Marcus Rashford  Marcus Rashford      Centre-Forward
    29         Eric Bailly      Eric Bailly         Centre-Back
    

    【讨论】:

    • 很好的答案,我确实与beautifulsoup(正如您正确建议的那样)进行了斗争,以首先获得正确的源数据。显然我做得不太好!不太擅长选择。但是,我怀疑您是 100% 正确的,首先获取正确的源代码是一种更有效的做事方式,谢谢。
    • @charliedontsurf,很高兴为您提供帮助!我相信你将不得不废弃更多的数据。如果您使用 chrome,我喜欢右键单击并单击检查,这是 websrcaping 的最佳工具。您可以沿着树向下工作并突出显示页面上的位置。然后尝试一次用 bs4 过滤它们:)
    • 我还有一个问题(我已经取得了很好的进展!)但是我跳过的一些项目有时会以 ab' 前缀返回,有时会返回 [b' ... ] 我不知道为什么会这样是!我想在继续之前清理结果。我认为这与我正在抓取的数据类型有关......
    猜你喜欢
    • 2011-06-17
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-12-25
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多