【问题标题】:Python web scraping and saving to a pandas dataframePython网页抓取并保存到熊猫数据框
【发布时间】:2020-02-02 21:28:06
【问题描述】:

我正在尝试在 remax 页面上抓取房屋列表并将该信息保存到 Pandas 数据框。但由于某种原因,它一直给我 KeyError。这是我的代码:

import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.remax.ca/ab/calgary-real-estate/720-37-st-nw-wp_id251536557-lst'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
detail_title = soup.find_all(class_='detail-title')
details_t = pd.DataFrame(detail_title)

这是我得到的错误:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-6-3be49b8e4cfc> in <module>
      6 soup = BeautifulSoup(response.text, 'html.parser')
      7 detail_title = soup.find_all(class_='detail-title')
----> 8 details_t = pd.DataFrame(detail_title)

~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    449                 else:
    450                     mgr = init_ndarray(data, index, columns, dtype=dtype,
--> 451                                        copy=copy)
    452             else:
    453                 mgr = init_dict({}, index, columns, dtype=dtype)

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py in init_ndarray(values, index, columns, dtype, copy)
    144     # by definition an array here
    145     # the dtypes will be coerced to a single dtype
--> 146     values = prep_ndarray(values, copy=copy)
    147 
    148     if dtype is not None:

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py in prep_ndarray(values, copy)
    228         try:
    229             if is_list_like(values[0]) or hasattr(values[0], 'len'):
--> 230                 values = np.array([convert(v) for v in values])
    231             elif isinstance(values[0], np.ndarray) and values[0].ndim == 0:
    232                 # GH#21861

~/anaconda3/lib/python3.7/site-packages/bs4/element.py in __getitem__(self, key)
   1014         """tag[key] returns the value of the 'key' attribute for the tag,
   1015         and throws an exception if it's not there."""
-> 1016         return self.attrs[key]
   1017 
   1018     def __iter__(self):

KeyError: 0

任何帮助将不胜感激!

【问题讨论】:

    标签: python pandas dataframe web-scraping beautifulsoup


    【解决方案1】:

    你可以试试这个。我假设您只需要 &lt;span&gt; 标签中的文本。但请随时根据我的工作示例进行调整。

    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    url = 'https://www.remax.ca/ab/calgary-real-estate/720-37-st-nw-wp_id251536557-lst'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    detail_title = soup.find_all(class_='detail-title')
    
    ls = []
    
    for _ in detail_title:
      ls.append(_.text)
    
    df = pd.DataFrame(data=ls)
    
    print(df)
    

    输出

                               0
    0            Property Type:
    1             Property Tax:
    2             Last Updated:
    3        Property Sub Type:
    4                  MLS® #:
    5           Ownership-Type:
    6               Year Built:
    7                     sqft:
    8              Date Listed:
    9                 Lot Size:
    10               Occupancy:
    11             Subdivision:
    12                 Heating:
    13          Heating Source:
    14          Full Bathrooms:
    15          Half Bathrooms:
    16                   Rooms:
    17                Basement:
    18    Basement Development:
    19                Flooring:
    20          Parking Spaces:
    21                 Parking:
    22                    Area:
    23                Exterior:
    24              Foundation:
    25                    Roof:
    26                   Faces:
    27  Miscellaneous Features:
    28         Lot Description:
    29                   Condo:
    30                Board ID:
    31                   Suite:
    32                Features:
    

    编辑: print(type(detail_title)) 给出 &lt;class 'bs4.element.ResultSet'&gt;,它不是可接受的数据类型。 来自https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

    数据:ndarray(结构化或同构)、Iterable、dict 或 DataFrame

    【讨论】:

      【解决方案2】:

      detail_title 不包含可以放入数据框中的内容:它是 BeautifulSoup "bs4.element.Tag" 对象的列表(请参阅 type(detail_title[0]) 为您提供的内容)。请尝试以下操作:

      步骤 1. 提取列标题

      import pandas as pd
      import requests
      from bs4 import BeautifulSoup
      url = 'https://www.remax.ca/ab/calgary-real-estate/720-37-st-nw-wp_id251536557-lst'
      response = requests.get(url)
      soup = BeautifulSoup(response.text, 'html.parser')
      detail_title = soup.find_all(class_='detail-title')
      
      headings = [d.text for d in detail_title]
      details_t = pd.DataFrame(columns = headings)
      

      第 2 步。在 html 中上一层并获取详细名称和值对。 (详细名称是您在步骤 1 中已经提取的名称)。编写一个辅助函数以返回给定名称的值。

      details = soup.find_all(class_='detail-row ng-star-inserted')
      def get_detail_value(detail_title, details): 
          return [(d.find(class_='detail-value')).text for d in details if (d.find(class_='detail-title')).text == detail_title]
      

      如果您只抓取 1 页,这样做有点奇怪。我认为您要做的是运行第 1 步以获取详细名称,然后在您要抓取的所有页面上运行第 2 步。

      第 3 步。对于您抓取的每个页面,将找到的详细信息值附加到数据框。

      details_t = details_t.append({deet:get_detail_value(deet, details) for deet in details_t.columns}, ignore_index = True)
      

      【讨论】:

      • 谢谢,这真的很有帮助!
      • @SushantDeshpande 给您作为新用户的一个温馨提示:堆栈溢出礼仪是对您认为有帮助的所有答案进行投票,并将绿色勾号放在最接近您的问题的答案旁边。跨度>
      • @butterflyknife Sushant Deshpande 尚未解锁投票选项。我会帮你。来自我的 +1。
      猜你喜欢
      • 2021-11-06
      • 2019-05-17
      • 1970-01-01
      • 1970-01-01
      • 2022-11-24
      • 1970-01-01
      • 2019-04-02
      • 1970-01-01
      • 2017-06-14
      相关资源
      最近更新 更多