【问题标题】:Pandas read_html unable to read tablesPandas read_html 无法读取表格
【发布时间】:2021-05-16 13:44:33
【问题描述】:

我正在使用以下代码:

import requests, pandas as pd
from bs4 import BeautifulSoup

if __name__ == '__main__':
    url = 'https://www.har.com/homedetail/6408-burgoyne-rd-157-houston-tx-77057/3380601'
    list_of_dataframes = pd.read_html(url)

但是,在list_of_dataframes 中,上面的网址中页面底部没有可用的学校信息。

我想知道如何在如下数据框中获取以下信息:

School                         Stars  Rating
BRIARGROVE Elementary School   4      Good
TANGLEWOOD Middle School       4      Good
WISDOM High School High        3      Average

TIA

【问题讨论】:

    标签: pandas beautifulsoup python-3.8


    【解决方案1】:

    您无法通过pandas 获取该学校信息,因为这不是表格。这些只是普通的divs,因此您必须解析HTML,然后将数据转储到pd.DataFrame

    这是怎么做的:

    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    
    if __name__ == '__main__':
        url = 'https://www.har.com/homedetail/6408-burgoyne-rd-157-houston-tx-77057/3380601'
        soup = BeautifulSoup(requests.get(url).text, "lxml").find("div", {"id": "SCHOOLS"})
        schools = soup.find_all("div", class_="border_row")
        schools_data = []
        for school in schools:
            name = school.find("a").getText()
            stars = len([i for i in school.find_all("img") if "star" in i["src"]])
            rating = school.getText().split()[-2]
            schools_data.append(
                [
                    name,
                    stars,
                    rating,
                ]
            )
        print(pd.DataFrame(schools_data, columns=["School", "Stars", "Rating"]))
    

    输出:

                             School  Stars   Rating
    0  BRIARGROVE Elementary School      4     Good
    1      TANGLEWOOD Middle School      4     Good
    2            WISDOM High School      3  Average
    

    【讨论】:

      猜你喜欢
      • 2021-02-07
      • 1970-01-01
      • 2013-11-27
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多