【发布时间】:2021-02-20 03:50:12
【问题描述】:
我正在尝试通过浏览主页上提供的不同地区来从该网站https://schoolportal.punjab.gov.pk/sed_census/ 抓取旁遮普的学校信息。 (例如,对于拉瓦尔品第地区,我正在抓取的 html 是:https://schoolportal.punjab.gov.pk/sed_census/new_emis_details.aspx?distId=373--Rawalpindi)
目标是创建一个包含(至少)包含 school_name、school_gender、school_level 和 location 列的数据框。
在下方运行-
from bs4 import BeautifulSoup
r = requests.get('https://schoolportal.punjab.gov.pk/sed_census/new_emis_details.aspx?distId=373--Rawalpindi')
soup = BeautifulSoup(r.text, 'html.parser')
soup.find_all('font', {'color':['#333333', '#284775']})[36:]
返回网站上表格的每个单元格,而不是返回行:
[<font color="#333333"><a href="list_of_emis_detail.aspx?emiscode=37350153">37350153</a></font>,
<font color="#333333">GGPS BADNIAN</font>,
<font color="#333333">Female</font>,
<font color="#333333">Primary</font>,
<font color="#333333">Badnian</font>,
<font color="#333333"><a href="http://maps.google.com/?ie=UTF8&q=GGPS BADNIAN@33.47595,73.328" target="_blank"><img height="70" src="images/mapsingle.jpg"/></a></font>,
<font color="#333333"><a href="sch_surrounding.aspx?mauza=Badnian&distid=373"><img height="70" src="images/mapsmulti.jpg"/></a></font>,
<font color="#284775"><a href="list_of_emis_detail.aspx?emiscode=37320269">37320269</a></font>,
<font color="#284775">GGPS JANDALA</font>,
<font color="#284775">Female</font>,
<font color="#284775">Primary</font>,
<font color="#284775">Potha Sharif</font>,
<font color="#284775"><a href="http://maps.google.com/?ie=UTF8&q=GGPS JANDALA@33.95502,73.50301" target="_blank"><img height="70" src="images/mapsingle.jpg"/></a></font>,
<font color="#284775"><a href="sch_surrounding.aspx?mauza=Potha Sharif&distid=373"><img height="70" src="images/mapsmulti.jpg"/></a></font>,
<font color="#333333"><a href="list_of_emis_detail.aspx?emiscode=37310001">37310001</a></font>,
<font color="#333333">GHSS NARA</font>,
<font color="#333333">Male</font>,
<font color="#333333">H.Sec.</font>,
<font color="#333333">Nara</font>,
<font color="#333333"><a href="http://maps.google.com/?ie=UTF8&q=GHSS NARA@33.5401766980066,73.5258855577558" target="_blank"><img height="70" src="images/mapsingle.jpg"/></a></font>,
<font color="#333333"><a href="sch_surrounding.aspx?mauza=Nara&distid=373"><img height="70" src="images/mapsmulti.jpg"/></a></font>,
<font color="#284775"><a href="list_of_emis_detail.aspx?emiscode=37310003">37310003</a></font>,
<font color="#284775">GHS HANESAR</font>,
<font color="#284775">Male</font>,
.....
etc...
所以前七个元素
我被困在如何以一种干净、优雅的方式创建数据框。 我曾考虑将它们分成 7 个元素(根据 How to group elements in python by n elements?),但我想知道是否有更准确和更有效的方法。
【问题讨论】:
标签: python html regex dataframe beautifulsoup