Python网页抓取并保存到熊猫数据框答案

【问题标题】：Python web scraping and saving to a pandas dataframePython网页抓取并保存到熊猫数据框
【发布时间】：2020-02-02 21:28:06
【问题描述】：

我正在尝试在 remax 页面上抓取房屋列表并将该信息保存到 Pandas 数据框。但由于某种原因，它一直给我 KeyError。这是我的代码：

import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.remax.ca/ab/calgary-real-estate/720-37-st-nw-wp_id251536557-lst'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
detail_title = soup.find_all(class_='detail-title')
details_t = pd.DataFrame(detail_title)

这是我得到的错误：

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-6-3be49b8e4cfc> in <module>
      6 soup = BeautifulSoup(response.text, 'html.parser')
      7 detail_title = soup.find_all(class_='detail-title')
----> 8 details_t = pd.DataFrame(detail_title)

~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    449                 else:
    450                     mgr = init_ndarray(data, index, columns, dtype=dtype,
--> 451                                        copy=copy)
    452             else:
    453                 mgr = init_dict({}, index, columns, dtype=dtype)

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py in init_ndarray(values, index, columns, dtype, copy)
    144     # by definition an array here
    145     # the dtypes will be coerced to a single dtype
--> 146     values = prep_ndarray(values, copy=copy)
    147 
    148     if dtype is not None:

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py in prep_ndarray(values, copy)
    228         try:
    229             if is_list_like(values[0]) or hasattr(values[0], 'len'):
--> 230                 values = np.array([convert(v) for v in values])
    231             elif isinstance(values[0], np.ndarray) and values[0].ndim == 0:
    232                 # GH#21861

~/anaconda3/lib/python3.7/site-packages/bs4/element.py in __getitem__(self, key)
   1014         """tag[key] returns the value of the 'key' attribute for the tag,
   1015         and throws an exception if it's not there."""
-> 1016         return self.attrs[key]
   1017 
   1018     def __iter__(self):

KeyError: 0

任何帮助将不胜感激！

【问题讨论】：

标签： python pandas dataframe web-scraping beautifulsoup

【解决方案1】：

你可以试试这个。我假设您只需要 <span> 标签中的文本。但请随时根据我的工作示例进行调整。

import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.remax.ca/ab/calgary-real-estate/720-37-st-nw-wp_id251536557-lst'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
detail_title = soup.find_all(class_='detail-title')

ls = []

for _ in detail_title:
  ls.append(_.text)

df = pd.DataFrame(data=ls)

print(df)

输出

                           0
0            Property Type:
1             Property Tax:
2             Last Updated:
3        Property Sub Type:
4                  MLSÂ® #:
5           Ownership-Type:
6               Year Built:
7                     sqft:
8              Date Listed:
9                 Lot Size:
10               Occupancy:
11             Subdivision:
12                 Heating:
13          Heating Source:
14          Full Bathrooms:
15          Half Bathrooms:
16                   Rooms:
17                Basement:
18    Basement Development:
19                Flooring:
20          Parking Spaces:
21                 Parking:
22                    Area:
23                Exterior:
24              Foundation:
25                    Roof:
26                   Faces:
27  Miscellaneous Features:
28         Lot Description:
29                   Condo:
30                Board ID:
31                   Suite:
32                Features:

编辑： print(type(detail_title)) 给出 <class 'bs4.element.ResultSet'>，它不是可接受的数据类型。来自https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

数据：ndarray（结构化或同构）、Iterable、dict 或 DataFrame

【讨论】：

【解决方案2】：

detail_title 不包含可以放入数据框中的内容：它是 BeautifulSoup "bs4.element.Tag" 对象的列表（请参阅 type(detail_title[0]) 为您提供的内容）。请尝试以下操作：

步骤 1. 提取列标题

import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.remax.ca/ab/calgary-real-estate/720-37-st-nw-wp_id251536557-lst'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
detail_title = soup.find_all(class_='detail-title')

headings = [d.text for d in detail_title]
details_t = pd.DataFrame(columns = headings)

第 2 步。在 html 中上一层并获取详细名称和值对。（详细名称是您在步骤 1 中已经提取的名称）。编写一个辅助函数以返回给定名称的值。

details = soup.find_all(class_='detail-row ng-star-inserted')
def get_detail_value(detail_title, details): 
    return [(d.find(class_='detail-value')).text for d in details if (d.find(class_='detail-title')).text == detail_title]

如果您只抓取 1 页，这样做有点奇怪。我认为您要做的是运行第 1 步以获取详细名称，然后在您要抓取的所有页面上运行第 2 步。

第 3 步。对于您抓取的每个页面，将找到的详细信息值附加到数据框。

details_t = details_t.append({deet:get_detail_value(deet, details) for deet in details_t.columns}, ignore_index = True)

【讨论】：

谢谢，这真的很有帮助！
@SushantDeshpande 给您作为新用户的一个温馨提示：堆栈溢出礼仪是对您认为有帮助的所有答案进行投票，并将绿色勾号放在最接近您的问题的答案旁边。跨度>
@butterflyknife Sushant Deshpande 尚未解锁投票选项。我会帮你。来自我的 +1。