读取文件夹中的多个文件并创建熊猫数据框答案

【问题标题】：reading multiple files in a folder and creating a pandas dataframe读取文件夹中的多个文件并创建熊猫数据框
【发布时间】：2018-01-23 14:56:47
【问题描述】：

我正在将大型 pickle 文件读取到 pandas 数据框，我加载了其中一个，它以我需要的方式加载。但是，我有一个包含 40 个泡菜文件的文件夹，分别命名为 imdbnames0.pkl、imdbnames1.pkl、imdbnames2.pkl、....、imdbnames40.pkl。

我想以与下面类似的方式加载它们并将它们完全合并到一个单一的 pandas 数据框中。

fh = open("ethnicity_files/imdbnames1.pkl", 'rb')
d = pickle.load(fh)
df = pd.concat({k:json_normalize(v, 'scores', ['best']) for k,v in d.items()})
df = df.reset_index(level=1, drop=True).rename_axis('names').reset_index()
df.head()



names   ethnicity   score   best
0   !Gubi Tietie    Asian   0.03    GreaterEuropean
1   !Gubi Tietie    GreaterAfrican  0.01    GreaterEuropean
2   !Gubi Tietie    GreaterEuropean 0.96    GreaterEuropean
3   !Gubi Tietie    British 0.17    WestEuropean
4   !Gubi Tietie    Jewish  0.13    WestEuropean
5   !Gubi Tietie    WestEuropean    0.65    WestEuropean
6   !Gubi Tietie    EastEuropean    0.05    WestEuropean
7   !Gubi Tietie    Nordic  0.00    Italian
8   !Gubi Tietie    Italian 0.69    Italian
9   !Gubi Tietie    Hispanic    0.12    Italian
10  !Gubi Tietie    French  0.16    Italian
11  !Gubi Tietie    Germanic    0.02    Italian
12  $2 Tony Asian   0.00    GreaterEuropean
13  $2 Tony GreaterAfrican  0.00    GreaterEuropean
14  $2 Tony GreaterEuropean 1.00    GreaterEuropean
15  $2 Tony British 0.00    WestEuropean
16  $2 Tony Jewish  0.00    WestEuropean
17  $2 Tony WestEuropean    1.00    WestEuropean
18  $2 Tony EastEuropean    0.00    WestEuropean
19  $2 Tony Nordic  0.00    Italian

一个文件是以下https://drive.google.com/file/d/10cjsoWFJ46w-2lEsxh6hmuRZlLunatf-/view?usp=sharing。

我只想将它们全部添加到一个 pandas 数据框中。

【问题讨论】：

标签： python python-3.x pandas pickle

【解决方案1】：

我觉得你需要os.listdir():

#Be careful this might give you a memory error if you 
#don't have enough ram for all your files 
#and make sure the folder contains only the files you want to read
import os
files = os.listdir('ethnicity_files/')

list_of_dfs = []
for file in files:
    d = pickle.load(os.path.join('ethnicity_files/',file))
    df = pd.concat({k:json_normalize(v, 'scores', ['best']) for k,v in d.items()})
    df = df.reset_index(level=1, drop=True).rename_axis('names').reset_index()
    list_of_dfs.append(df)
big_df = pd.concat(list_of_dfs, ignore_index=True)#ignore_index to reset index of big_df
big_df.head()

【讨论】：

【解决方案2】：

您可以使用glob.glob 以特定扩展名（在您的情况下为.pkl）迭代当前文件夹中的所有文件

import os
import glob
cd=os.getcwd()
os.chdir('path_to_your_folder')

for file in glob.glob("*.pkl"):
  fh = open(str(file), 'rb')
  d = pickle.load(fh)
  df = pd.concat({k:json_normalize(v, 'scores', ['best']) for k,v in d.items()})
  df = df.reset_index(level=1, drop=True).rename_axis('names').reset_index()
os.chdir(cd)
print df.head()

【讨论】：