如何将多个 csv 文件中的列连接/连接到 1 个 DataFrame() 中？答案

【问题标题】：How to concat/join columns from multiple csv files into 1 DataFrame()?如何将多个 csv 文件中的列连接/连接到 1 个 DataFrame() 中？
【发布时间】：2020-03-30 02:51:08
【问题描述】：

我使用的数据集是：https://www.kaggle.com/rohanrao/nifty50-stock-market-data

它包含自 2000 年至 2020 年所有 NIFTY50 公司的股票市场数据。每个文件包含以下列：['Date', 'Symbol', 'Series', 'Prev Close', 'Open', 'High', 'Low', 'Last', 'Close', 'VWAP', 'Volume', 'Turnover', 'Trades', 'Deliverable Volume', '%Deliverble']

我需要将所有文件中的'Close' 列编译成一个数据帧。以日期作为索引，列名作为文件名，即

Date                       ADANIPORTS          ASIANPAINTS       AXISBANK .....
2000-01-01                     0               1500               300
2000-02-02                     1               1600               400
...

某些文件仅包含较晚日期（例如 01-01-2007）的数据，如果缺少 'Close' 的值，则应将其列为 0，即 0 直到数据可用的日期。

目前我正在使用此代码。

df=pd.DataFrame()
for filename in filenames:
    file=dir+filename+'.csv'
    data = pd.read_csv(file,usecols=lambda x: x in ['Date', 'Close'])
    data.rename(columns = {'Close':filename}, inplace = True)
    data.set_index('Date',inplace=True)
    df.join(data, how='outer')

这会返回一个 (0,0) DataFrame->df

我试过了

#Initialising df with GRASIM.csv, and then using join for the other dataframes
file01 = dir + "GRASIM" + '.csv'
df=pd.read_csv(file01,usecols=lambda x: x in ['Date', 'Close'])
df.rename(columns = {'Close':"GRASIM"}, inplace = True)
df.set_index('Date',inplace = True)

for filename in filenames:
    file=dir+filename+'.csv'
    data = pd.read_csv(file,usecols=lambda x: x in ['Date', 'Close'])
    data.rename(columns = {'Close':filename}, inplace = True)
    data.set_index('Date',inplace=True)
    df.join(data, how='outer')

但这会返回最初初始化的数据帧，即

          GRASIM
Date              
2000-01-03  438.30
2000-01-04  437.15
...            ...

不添加其他列。

这似乎是什么问题？

【问题讨论】：

标签： python database pandas dataframe merge

【解决方案1】：

解决此问题的一种方法是在 Python 中使用 zipfile 模块：

from zipfile import ZipFile

#initialize an empty dataframe
df = []
with ZipFile('nifty50-stock-market-data.zip') as myzip:
    #get the list of files in the zip
    for file in myzip.namelist():
        #read each file in the list
        with myzip.open(file) as myfile:
            #read the file with pandas
            #append filename to the dataframe
            #and add to the empty df dataframe
            #all columns r read in, since some files
            #do not have date or close columns
            df.append(pd.read_csv(myfile)
                      .assign(filename = myfile.name.split('.')[0])
                      )
         #concatenate everything and filter for the three relevant columns
         everything = pd.concat(df).filter(['Date','Close','filename'])

 everything.head()  

        Date     Close  filename
0   2007-11-27  962.90  ADANIPORTS
1   2007-11-28  893.90  ADANIPORTS
2   2007-11-29  884.20  ADANIPORTS
3   2007-11-30  921.55  ADANIPORTS
4   2007-12-03  969.30  ADANIPORTS

【讨论】：

这行得通。另外我刚刚发现这个问题的解决方案是df=df.join(data, how='outer') 这完全是关于就地操作符

【解决方案2】：

我不清楚你在寻找什么输出。无论如何，我会解释我做了什么。首先，我将文件解压缩到 C-drive 上的 Kaggle 文件夹中，然后使用 os.chdir() 将其更改为当前目录.

对于循环，我读入了数据和所需的列，并将这些列重命名为它们的文件名，不带os.path.splitext 的扩展名。接下来，我将附加到我之前创建的列表中。之后，数据全部连接在一起，我用零替换了 NaN。我还包括一个注释掉的行——如果您更改列名，您可以检查任何给定的列。

import os
import pandas as pd
os.chdir('C:/Kaggle')
data_list=[]
for file in os.listdir():
    data=pd.read_csv(file, usecols=lambda x: x in ['Date', 'Close'])
    data.rename(columns = {'Close':data.rename(columns = {'Close':os.path.splitext(file)[0]}, inplace = True)}, inplace = True)
    data_list.append(data)
data = pd.concat(data_list, sort=False)
data = data.fillna(0)
# data = data.loc[data.ASIANPAINT !=0]
data

【讨论】：