【问题标题】:Getting a dataframe using Beautifulsoup python使用 Beautifulsoup python 获取数据框
【发布时间】:2021-02-20 03:50:12
【问题描述】:

我正在尝试通过浏览主页上提供的不同地区来从该网站https://schoolportal.punjab.gov.pk/sed_census/ 抓取旁遮普的学校信息。 (例如,对于拉瓦尔品第地区,我正在抓取的 html 是:https://schoolportal.punjab.gov.pk/sed_census/new_emis_details.aspx?distId=373--Rawalpindi

目标是创建一个包含(至少)包含 school_name、school_gender、school_level 和 location 列的数据框。

在下方运行-

from bs4 import BeautifulSoup

r = requests.get('https://schoolportal.punjab.gov.pk/sed_census/new_emis_details.aspx?distId=373--Rawalpindi')
soup = BeautifulSoup(r.text, 'html.parser')
soup.find_all('font', {'color':['#333333', '#284775']})[36:]

返回网站上表格的每个单元格,而不是返回

[<font color="#333333"><a href="list_of_emis_detail.aspx?emiscode=37350153">37350153</a></font>,
 <font color="#333333">GGPS BADNIAN</font>,
 <font color="#333333">Female</font>,
 <font color="#333333">Primary</font>,
 <font color="#333333">Badnian</font>,
 <font color="#333333"><a href="http://maps.google.com/?ie=UTF8&amp;q=GGPS BADNIAN@33.47595,73.328" target="_blank"><img height="70" src="images/mapsingle.jpg"/></a></font>,
 <font color="#333333"><a href="sch_surrounding.aspx?mauza=Badnian&amp;distid=373"><img height="70" src="images/mapsmulti.jpg"/></a></font>,
 <font color="#284775"><a href="list_of_emis_detail.aspx?emiscode=37320269">37320269</a></font>,
 <font color="#284775">GGPS JANDALA</font>,
 <font color="#284775">Female</font>,
 <font color="#284775">Primary</font>,
 <font color="#284775">Potha Sharif</font>,
 <font color="#284775"><a href="http://maps.google.com/?ie=UTF8&amp;q=GGPS JANDALA@33.95502,73.50301" target="_blank"><img height="70" src="images/mapsingle.jpg"/></a></font>,
 <font color="#284775"><a href="sch_surrounding.aspx?mauza=Potha Sharif&amp;distid=373"><img height="70" src="images/mapsmulti.jpg"/></a></font>,
 <font color="#333333"><a href="list_of_emis_detail.aspx?emiscode=37310001">37310001</a></font>,
 <font color="#333333">GHSS NARA</font>,
 <font color="#333333">Male</font>,
 <font color="#333333">H.Sec.</font>,
 <font color="#333333">Nara</font>,
 <font color="#333333"><a href="http://maps.google.com/?ie=UTF8&amp;q=GHSS NARA@33.5401766980066,73.5258855577558" target="_blank"><img height="70" src="images/mapsingle.jpg"/></a></font>,
 <font color="#333333"><a href="sch_surrounding.aspx?mauza=Nara&amp;distid=373"><img height="70" src="images/mapsmulti.jpg"/></a></font>,
 <font color="#284775"><a href="list_of_emis_detail.aspx?emiscode=37310003">37310003</a></font>,
 <font color="#284775">GHS HANESAR</font>,
 <font color="#284775">Male</font>,
.....
etc... 

所以前七个元素

我被困在如何以一种干净、优雅的方式创建数据框。

我曾考虑将它们分成 7 个元素(根据 How to group elements in python by n elements?),但我想知道是否有更准确和更有效的方法。

【问题讨论】:

    标签: python html regex dataframe beautifulsoup


    【解决方案1】:
    import requests
    import pandas as pd
    from bs4 import BeautifulSoup
    
    
    url = 'https://schoolportal.punjab.gov.pk/sed_census/'
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    area_urls = ['https://schoolportal.punjab.gov.pk/sed_census/' + href['href'] for href in soup.select('map [href]')]
    
    all_data = []
    for u in area_urls:
        print('Getting data from page {} ...'.format(u))
        soup = BeautifulSoup(requests.get(u).content, 'html.parser')
        district = soup.b.text
    
        for row in soup.select('#main1_grd_emis_details tr:has(td)'):
            tds = [td.get_text(strip=True) for td in row.select('td')]
            all_data.append([district] + tds[:5])
    
    df = pd.DataFrame(all_data, columns='district emiscode school_name school_gender school_level moza'.split())
    df.to_csv('data.csv')
    print(df)
    

    打印:

    Getting data from page https://schoolportal.punjab.gov.pk/sed_census/new_emis_details.aspx?distId=352--Lahore ...
             district  emiscode                                        school_name school_gender school_level           moza
    0     352--Lahore  35210532                                   GGPS CHINKOWINDI        Female      Primary    CHINKOWINDI
    1     352--Lahore  35210001                    GHSS COMPRESHENSIVE GHORAY SHAH          Male       H.Sec.    Gujjar Pura
    2     352--Lahore  35210002           GGHSS SHEIKH SARDAR MUHAMMAD GARHI SHAHU        Female       H.Sec.         lahore
    3     352--Lahore  35210003                                    GGHSS SAMANABAD        Female       H.Sec.               
    4     352--Lahore  35210004                                        GGHSS BARKI        Female       H.Sec.          Barki
    ...           ...       ...                                                ...           ...          ...            ...
    1213  352--Lahore  35230680  GGPS OUT SIDE BABLIANA (Shifted from Kasur To ...        Female      Primary  NOOR MUHAMMAD
    1214  352--Lahore  35211007                             GGPS CHUNGI AMER SIDHU        Female      Primary              0
    1215  352--Lahore  35250306                               GPS GOPAL SINGH WALA          Male      Primary    GOPAL SINGH
    1216  352--Lahore  35240728                                 GGPS PATTI KASHMIR        Female      Primary  PATTI KASHMIR
    1217  352--Lahore  35230678  GGPS WARA JHANDA SINGH (SHIFTED FROM KASUR TO ...        Female      Primary    WARA JHANDA
    
    [1218 rows x 6 columns]
    
    ...etc.
    

    并保存 data.csv(来自 LibreOffice 的屏幕截图):


    编辑:要获得经度和纬度,您可以:

    import requests
    import pandas as pd
    from bs4 import BeautifulSoup
    
    
    url = 'https://schoolportal.punjab.gov.pk/sed_census/'
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    area_urls = ['https://schoolportal.punjab.gov.pk/sed_census/' + href['href'] for href in soup.select('map [href]')]
    
    all_data = []
    for u in area_urls:
        u = 'https://schoolportal.punjab.gov.pk/sed_census/new_emis_details.aspx?distId=383--Mianwali'
    
        print('Getting data from page {} ...'.format(u))
        soup = BeautifulSoup(requests.get(u).content, 'html.parser')
        district = soup.b.text
    
        for row in soup.select('#main1_grd_emis_details tr:has(td)'):
            tds = [td.get_text(strip=True) for td in row.select('td')]
            a = row.select_one('a[href*="maps.google.com"]')
            lon, lat = a['href'].split('@')[-1].split(',')
            all_data.append([district] + tds[:5] + [lon, lat])
    
        break
    
    df = pd.DataFrame(all_data, columns='district emiscode school_name school_gender school_level moza lon lat'.split())
    df.to_csv('data.csv')
    print(df)
    

    打印:

               district  emiscode                            school_name school_gender school_level                       moza               lon               lat
    0     383--Mianwali  38310001                         GHSS TABBI SAR          Male       H.Sec.  Poss Bangi Khela Darmiani  33.1439236085861  71.5508843678981
    1     383--Mianwali  38310002                     GHSS KAMAR MUSHANI          Male       H.Sec.                    Sodhari  32.8450116561725  71.3622024469077
    2     383--Mianwali  38310003                           GHS ISA KHEL          Male         High                   Isa Khel  32.6850186428055   71.272792853415
    3     383--Mianwali  38310004                       GHS KHAGLAN WALA          Male         High                khaglanwala  32.6359399594366  71.2692983541637
    4     383--Mianwali  38310005                      GHS KALLOR SHARIF          Male         High                     Kallur  32.7383419219404  71.2667574640363
    ...             ...       ...                                    ...           ...          ...                        ...               ...               ...
    1294  383--Mianwali  38331264                 GPS DERA BALOCHAN WALA          Male      Primary                  Maly wali  32.2964028501883  71.2868203874677
    1295  383--Mianwali  38331267  GES DERA MUHAMMAD NAWAZ SULTANAY WALA          Male       Middle                    Harnoli        32.3521257        71.5292018
    
    ...
    

    【讨论】:

    • 这太棒了,非常感谢!我也在尝试获取位置(纬度和经度),但似乎我们只从“tds = [td.get_text(strip= True) for td in row.select('td')]"。所以比如从"maps.google.com/?ie=UTF8&amp;q=GHSNIDDOKE@32.1145927906036,74.7351837158203",我还需要得到32.1145...和74.73518...out。是有没有办法考虑到这一点?
    【解决方案2】:

    首先你需要使用 id、class、css 或 xpath 选择器(请用谷歌搜索它们)来获取页面的元素。这样做的原因是为了避免易碎的定位器。例如,在您的情况下,将选择具有该字体颜色的任何内容。但是让我们说如果你使用这个 css 选择器

    #main1_grd_emis_details tr
    

    现在只选择表中的记录行。我敦促您在继续之前使用谷歌网页元素选择器并了解它们。 现在,如果您想在此表中获取第 n 个元素,则可以在 java 脚本中像这样修改上面的选择器。将 n 替换为从 1 开始的索引。

    #main1_grd_emis_details tr:nth-child(n)
    

    在美丽的汤第 n 个子选择器中我认为是 nth-of-type(n) 所以上面的选择器会变成

    #main1_grd_emis_details tr:nth-of-type(n)
    

    让我们说获得第二个孩子的 python 代码是

    someRow = soup.select_one("#main1_grd_emis_details tr:nth-of-type(2)")
    

    现在要连续获取每个列,您可以再次应用 css 选择器,也许这个(我没有测试过可能是错误的)

    "td:nth-of-type(n)"
    

    从每一行中提取任何需要的内容,例如 text 或 href(您也可以用 google 搜索)并将其放入字典中,然后将该字典添加到数据框中。

    【讨论】:

      猜你喜欢
      • 2013-01-29
      • 2021-01-05
      • 2019-03-15
      • 2018-05-22
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-12-17
      • 2020-06-03
      相关资源
      最近更新 更多