使用 Pandas 附加 excel 电子表格答案

【问题标题】：Append excel spreadsheets using Pandas使用 Pandas 附加 excel 电子表格
【发布时间】：2019-08-11 14:48:20
【问题描述】：

我在一个文件夹中有以下数据集：

a) 10 个 excel 电子表格（名称不同）

b) 每个电子表格有 7 个标签。在每个电子表格的 7 个选项卡中，2 个具有完全相同的名称，而其余 5 个具有不同的工作表名称。

c) 我需要连接来自 10 个不同电子表格的五个 Excel 工作表。

d) 在所有 10*5 表中需要连接。

我该怎么做才能连接所有 50 个电子表格，最终输出是一个附加了所有 50 个电子表格的“主”电子表格（不连接每个 Excel 文件中名称完全相同的两张工作表）？

我正在使用以下代码使用 jupyter notebook 连接工作表，但它没有帮助：

import pandas as pd

xlsx = pd.ExcelFile('A://Data/File.xlsx')
data_sheets = []
for sheet in xlsx.sheet_names:
    data_sheets.append(xlsx.parse(sheet))
data = pd.concat(data_sheets)
print(data)

感谢阅读。

【问题讨论】：

所有工作表都具有相同的数据结构吗？
@dubbbdan 是的，所有五个（具有不同的名称）具有完全相同的数据结构，而另外两个（在所有 excel 文件中具有相同的名称）具有完全不同的数据结构。我不关心两个名字相同的人。我需要来自 5 的数据。
它们的顺序总是相同吗？你怎么知道你想要的是哪张表（有重复的名字）？
@ dubbbdan，例如：说：第一个电子表格具有以下工作表名称：['A','B',1,2,3,4,5]，第二个电子表格具有以下工作表名称：[ 'A','B',9,10,11,12,13]。常见的是表格“A”和“B”（我不需要这些），而其余的都需要在彼此下方附加。

标签： python python-3.x pandas

【解决方案1】：

IIUC，您需要阅读 10 个工作簿中的所有工作表，并将每个数据框附加到列表 data_sheets。一种方法是分配一个列表names_to_find 并在您迭代时附加每个工作表名称。

names_to_find =[]
data_sheets = []
for excelfile in excelfile_list:
   xlsx = pd.ExcelFile(excelfile)

   for sheet in xlsx.sheet_names:
      data_sheets.append(xlsx.parse(sheet))
      names_to_find.append(sheet)

读取完所有数据后，您可以使用names_to_find 和np.unique 查找唯一的工作表名称及其频率。

#find unique elements and return counts
unique, counts = np.unique(names_to_find,return_counts=True)

#find unique sheet names with a frequency of one
unique_set = unique[counts==1]

然后您可以使用np.argwhere 查找unique_set 在names_to_find 中存在的索引

#find the indices where the unique sheet names exist 
idx_to_select = np.argwhere(np.isin(names_to_find, unique_set)).flatten()

最后，对列表进行一些理解，您可以对data_sheets 进行子集化以包含感兴趣的数据：

#use list comprehension to subset data_sheets 
data_sheets = [data_sheets[i] for i in idx_to_select]
data = pd.concat(data_sheets)

大家一起：

import pandas as pd
import numpy as np
names_to_find =[]
data_sheets = []
for excelfile in excelfile_list:    
   xlsx = pd.ExcelFile(excelfile)

   for sheet in xlsx.sheet_names:        
      data_sheets.append(xlsx.parse(sheet))
      names_to_find.append(sheet)

#find unique elements and return counts
unique, counts = np.unique(names_to_find,return_counts=True)

#find unique sheet names with frequency of 1
unique_set = unique[counts==1]

#find the indices where the unique sheet names exist 
idx_to_select = np.argwhere(np.isin(names_to_find, unique_set)).flatten()

#use list comprehension to subset data_sheets subset data_sheets
data_sheets = [data_sheets[i] for i in idx_to_select]

#concat the data
data = pd.concat(data_sheets)

【讨论】：

我应该将 data_sheets 的名称放在空列表中吗？它给出错误：“没有要连接的对象”。我应该在哪里输入 Excel 工作表的名称？
我应该在哪里添加以下名称：a) 5 张和 b) 10 个工作簿？另外，我需要定义 excelfile_list 吗？
excelfile_list 应该是您要处理的所有 excel 文件的列表。 glob.glob 是你的朋友。
我应该在哪里添加 a) 要附加的数据表和 b) 不考虑附加的数据表？
此例程将决定为您附加哪些工作表。您只需要在代码顶部定义 excelfiles_list 即可。类似‘excelfile_list= glob.glob(‘some_directory//*.xlsx’)’的东西。然后运行代码。如果您想查看附加了哪些工作表，请查看“unique_set”。它的长度应该是 50。