将类“字符串”数组连接到数据框答案

【问题标题】：Concanate class 'str' array to dataframe将类“字符串”数组连接到数据框
【发布时间】：2021-05-15 04:12:45
【问题描述】：

我正在像这样从 onvista.de 抓取股票信息：

import pandas as pd
import requests

hdr={'User-Agent':'Chrome/70.0.3538.110'}

table_dfs={}

for page_number in range(3):
    http= "https://www.onvista.de/aktien/finder/?continent[0]=Europa&continent[1]=Nordamerika&continent[2]=Asien%20-%20Pazifik&PROFIT_PER_SHARE[enabled]=1&PROFIT_KGV[enabled]=1&MARKET_CAPITALIZATION[enabled]=1&PERFORMANCE_6_MONTHS[enabled]=1&PERFORMANCE_4_WEEKS[enabled]=1&SCREENER_INTEREST[enabled]=1&SCREENER_RISK_ZONE[enabled]=1&PROFIT_PER_SHARE[year]=2020&PROFIT_KGV[year]=2020&MARKET_CAPITALIZATION[year]=2020&offset={}".format(page_number*50)

    url= requests.get(http,headers=hdr)
    table_dfs[page_number]= pd.read_html(url.text)

我尝试使用列将结果连接到单个数据帧，我尝试了这个：

df = pd.concat(table_dfs)

但这会导致错误：

TypeError: cannot concatenate object of type "<class 'list'>"; 
only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid

table_dfs[0] 的输出如下所示：

[       WKN                          Wert                  Branche  \
 0   A2PSR2           BIONTECH SE SP.ADRS           Biotechnologie   
 1   A1JA81               PLUG POWER INC.       Elektrotechnologie   
 2   A0B733                           Nel  Sonstige Energie / R...   
 
               Land Gewinn pro Aktie (€)     KGV  \
 0      Deutschland          Deutschland     NaN   
 1              USA                  USA   -26.0   
 2         Norwegen             Norwegen     0.0   
 
     Marktkapitalisierung (Mio. €) Performance - 6M (%)  Performance - 4W (%)  \
 0                             NaN                  000            6139.00000   
 1                             NaN             12.76665           43430.00000   
 2                             NaN              3.97097            8434.00000   
 
     Chance-Rating (the Screener)  Risiko-Rating (the Screener)  Unnamed: 11  
 0                           1888                           NaN          NaN  
 1                           1962                           4.0          0.0  
 2                           -705                           1.0          0.0 ]

我的目标是将这些数据放入 csv 文件（所有行合并）。

感谢您的帮助

【问题讨论】：

你应该使用* 来解压它吗concat(*table_dfs) 或者你应该使用for-loop 来分别添加每个项目 - for table in table_dfs: df = df.concat(table)。你应该检查concat()的文档

标签： python dataframe web-scraping concatenation

【解决方案1】：

read_html() 在页面上搜索所有<table> 并将每个<table> 转换为DataFrame 并给出所有Dataframes 的列表（即使它在页面上找到单个<tabel>，或者如果它没有't found any table) 并且您必须使用 [0] 从列表中获取第一个 DataFrame。

pd.read_html(response.text)[0]

我看到其他问题：您将项目保留在 dictionary 中，当您使用 concat(table_dfs) 时，它将从字典中获得 keys，而不是 values 和 DataFrames。你必须使用table_dfs.values() 或者你应该使用list 而不是dictionary。

其他问题：所有DataFrames都有相同的索引，连接后你将有三倍索引0，三倍索引1，等等。像[0..49, 0..49, 0..49]，你可以使用concat(..., ignore_index=True)来拥有索引[0...149]

我的工作代码。

我更改了一些名称和格式代码（参见：PEP 8 -- Style Guide for Python Code）

import pandas as pd
import requests

headers = {'User-Agent': 'Chrome/70.0.3538.110'}  # PEP8: spaces around `=`

payload = {
     'continent[0]': 'Europa',
     'continent[1]': 'Nordamerika',
     'continent[2]': 'Asien - Pazifik',
     'PROFIT_PER_SHARE[enabled]': 1, 
     'PROFIT_KGV[enabled]': 1,
     'MARKET_CAPITALIZATION[enabled]': 1,
     'PERFORMANCE_6_MONTHS[enabled]': 1,
     'PERFORMANCE_4_WEEKS[enabled]': 1,
     'SCREENER_INTEREST[enabled]': 1,
     'SCREENER_RISK_ZONE[enabled]': 1,
     'PROFIT_PER_SHARE[year]': 2020,
     'PROFIT_KGV[year]': 2020,
     'MARKET_CAPITALIZATION[year]': 2020,
     'offset': 0,
}

url = 'https://www.onvista.de/aktien/finder/'

table_dfs = []  # list

for offset in range(0, 3*50, 50):

    payload['offset'] = offset

    response = requests.get(url, params=payload, headers=headers)
 
    all_tables = pd.read_html(response.text)
    table_dfs.append( all_tables[0] )  # `[0]` - first DataFrame from list
    
    #print(type(all_tables), all_tables)
    #print(type(all_tables[0]), all_tables[0])
    
df = pd.concat(table_dfs, ignore_index=True)

print(len(df))
print(df)

【讨论】：

美丽。谢谢。我会阅读风格指南 ;-)