【发布时间】:2021-12-30 02:22:47
【问题描述】:
我在连接这些 pandas 数据帧时遇到了麻烦,因为我不断收到错误消息 pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects 我也在努力让我的代码不那么笨重并且运行得更流畅。我还想知道是否有一种方法可以使用 python 在一个 csv 上获取多个页面。任何帮助都会很棒。
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
"=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"
t = URL + "&page_number="
URL2 = t + "2"
URL3 = t + "3"
s = requests.Session()
data = []
page = s.get(URL,headers=headers)
page2 = s.get(URL2, headers=headers)
page3 = s.get(URL3, headers=headers)
soup = BeautifulSoup(page.content, "lxml")
soup2 = BeautifulSoup(page2.content, "lxml")
soup3 = BeautifulSoup(page3.content, "lxml")
for row in soup.select('#propertysearchresults tr'):
data.append([c.get_text(' ',strip=True) for c in row.select('td')])
for row in soup2.select('#propertysearchresults tr'):
data.append([c.get_text(' ',strip=True) for c in row.select('td')])
for row in soup3.select('#propertysearchresults tr'):
data.append([c.get_text(' ',strip=True) for c in row.select('td')])
df1 = pd.DataFrame(data[1:], columns=data[0])
df2 = pd.DataFrame(data[2:], columns=data[1])
df3 = pd.DataFrame(data[3:], columns=data[2])
final = pd.concat([df1, df2, df3], axis=0)
final.to_csv('Street.csv', encoding='utf-8')
【问题讨论】:
-
您的数据变量已经包含了所有三个页面的表格单元格,对吧?所以……它已经“连接”了,对吧?我认为您唯一需要做的就是从第 2 页和第 3 页的表格中删除与真实数据一起附加的任何标题,或者在您的 td-iterator 中更具选择性并确保避免第 2 页中的第一行& 3.
标签: python-3.x pandas dataframe concatenation export-to-csv