【问题标题】:Concat multiple CSV's with the same column name连接多个具有相同列名的 CSV
【发布时间】:2021-12-30 02:22:47
【问题描述】:

我在连接这些 pandas 数据帧时遇到了麻烦,因为我不断收到错误消息 pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects 我也在努力让我的代码不那么笨重并且运行得更流畅。我还想知道是否有一种方法可以使用 python 在一个 csv 上获取多个页面。任何帮助都会很棒。

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}

URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
      "=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"

t = URL + "&page_number="
URL2 = t + "2"
URL3 = t + "3"

s = requests.Session()

data = []

page = s.get(URL,headers=headers)
page2 = s.get(URL2, headers=headers)
page3 = s.get(URL3, headers=headers)

soup = BeautifulSoup(page.content, "lxml")
soup2 = BeautifulSoup(page2.content, "lxml")
soup3 = BeautifulSoup(page3.content, "lxml")


for row in soup.select('#propertysearchresults tr'):
    data.append([c.get_text(' ',strip=True) for c in row.select('td')])
for row in soup2.select('#propertysearchresults tr'):
    data.append([c.get_text(' ',strip=True) for c in row.select('td')])
for row in soup3.select('#propertysearchresults tr'):
    data.append([c.get_text(' ',strip=True) for c in row.select('td')])


df1 = pd.DataFrame(data[1:], columns=data[0])
df2 = pd.DataFrame(data[2:], columns=data[1])
df3 = pd.DataFrame(data[3:], columns=data[2])

final = pd.concat([df1, df2, df3], axis=0)

final.to_csv('Street.csv', encoding='utf-8')

【问题讨论】:

  • 您的数据变量已经包含了所有三个页面的表格单元格,对吧?所以……它已经“连接”了,对吧?我认为您唯一需要做的就是从第 2 页和第 3 页的表格中删除与真实数据一起附加的任何标题,或者在您的 td-iterator 中更具选择性并确保避免第 2 页中的第一行& 3.

标签: python-3.x pandas dataframe concatenation export-to-csv


【解决方案1】:

会发生什么?

如前所述,@Zach Young data 已经保存了您希望转换为 一个 数据框的所有行。所以这不是pandas的问题,而是如何收集信息的问题。

如何解决?

一种基于您问题中代码的方法是选择更具体的表格数据 - 请注意选择中的 tbody,这将排除标题:

for row in soup.select('#propertysearchresults tbody tr'):
    data.append([c.get_text(' ',strip=True) for c in row.select('td')])

在创建数据框时,您可以另外设置列标题:

pd.DataFrame(data, columns=[c.get_text(' ',strip=True) for c in soup.select('#propertysearchresults thead td')])

示例

这将展示如何迭代包含您的表格的网站的不同页面:

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}

URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
      "=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"

s = requests.Session()

data = []
while True:

    page = s.get(URL,headers=headers)
    soup = BeautifulSoup(page.content, "lxml")

    for row in soup.select('#propertysearchresults tbody tr'):
        data.append([c.get_text(' ',strip=True) for c in row.select('td')])

    if (a := soup.select_one('#page_selector strong + a')):
        URL = "https://www.collincad.org"+a['href']
    else:
        break


pd.DataFrame(data, columns=[c.get_text(' ',strip=True) for c in soup.select('#propertysearchresults thead td')])

输出

Property ID ↓ Geographic ID ↓ Owner Name Property Address Legal Description 2021 Market Value
1 2709013 R-10644-00H-0010-1 PARTHASARATHY SURESH & ANITHA HARIKRISHNAN 12209 Willowgate Dr Frisco, TX 75035 Ridgeview At Panther Creek Phase 2, Blk H, Lot 1 $513,019
... ... ... ... ... ...
61 2129238 R-4734-00C-0110-1 HEPFER ARRON 990 Willowgate Dr Prosper, TX 75078 Willow Ridge Phase One, Blk C, Lot 11 $509,795

【讨论】:

    【解决方案2】:

    通常会遍历页码并连接数据框列表,但如果您只有三页,您的代码就可以了。

    因为for row in ... 总是写信给data,所以你的最终数据框是df1,但你只需要删除column-named 行。

    final = df1[df1['Property ID ↓ Geographic ID ↓']!='Property ID ↓ Geographic ID ↓']
    

    【讨论】:

      【解决方案3】:

      而不是你最后几行代码:

      df1 = pd.DataFrame(data[1:], columns=data[0])
      df2 = pd.DataFrame(data[2:], columns=data[1])
      df3 = pd.DataFrame(data[3:], columns=data[2])
      
      final = pd.concat([df1, df2, df3], axis=0)
      
      final.to_csv('Street.csv', encoding='utf-8')
      

      您可以使用它(避免分割成不同的数据帧和连接):

      final = pd.DataFrame(data[1:], columns=data[0])   # Sets the first row as the column names
      final = final.iloc[:,1:]   # Gets rid of the additional index column
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2021-07-08
        • 2016-10-05
        • 1970-01-01
        • 2015-10-29
        • 2021-11-25
        • 2013-04-19
        • 2016-03-24
        相关资源
        最近更新 更多