如何使用 Pandas 只读取 excel 标题？答案

【问题标题】：how to use Pandas to only read excel header?如何使用 Pandas 只读取 excel 标题？
【发布时间】：2020-07-07 19:42:50
【问题描述】：

我知道用 pandas 读取 excel 表：

import pandas as pd

table = pd.read_excel(io)

加载数据后，如果要获取表头：

table.columns

这种方法是可行的，但是有时候我只想直接获取excel表的表头，尤其是excel表体大的时候，把数据表加载到内存中会很耗时&这也是不必要的，有时甚至会直接溢出并卡住。看官方文档，好像可以用nrows参数指定只能读取Excel的特定行，也就是说我可以用它只读取第一行表头：

header = pd.read_excel(io, nrows = 0)

但是我发现pandas也无法阻止pandas读取整个excel数据，而且还是会消耗大量的时间和内存。您在处理这个问题方面有很好的经验吗？

【问题讨论】：

这能回答你的问题吗？ Reading column names alone in a csv file
不，xlsx 文件不同
所以只有文件扩展名发生变化，在更改文件扩展名后尝试该代码。
看看这个库，看看它是否有帮助pyexcel

标签： python excel pandas dataframe openpyxl

【解决方案1】：

import pandas as pd 

Frame=pd.read_excel("/content/data.xlsx" , header=0)
Frame.head()

【讨论】：

谢谢，但我只想获取表头，这种方法读取整个数据仍然需要大量时间和内存

【解决方案2】：

在互联网上找到的简单代码sn-p：

def read_excel(filename, nrows):
    book = openpyxl.load_workbook(filename=filename, read_only=True, data_only=True)
    first_sheet = book.worksheets[0]
    rows_generator = first_sheet.values


    header_row = next(rows_generator)
    data_rows = [row for (_, row) in zip(range(nrows - 1), rows_generator)]
    return pd.DataFrame(data_rows, columns=header_row)

【讨论】：

请reference您复制的任何材料。

【解决方案3】：

这个函数sheet_rows直接使用openpyxl，而不是pandas；它比read_excel( nrows=0 ) 快得多，而且简单：

#!/usr/bin/env python3

import openpyxl  # https://openpyxl.readthedocs.io

#...............................................................................
def sheet_rows( sheet, nrows=3, ncols=None, verbose=5 ) -> "list of lists":
    """ openpyxl sheet -> the first `nrows` rows x `ncols` columns
        verbose=5: print A1 .. A5, E1 .. E5 as lists
    """
    rows = sheet.iter_rows( max_row=nrows, max_col=ncols, values_only=True )
    rows = [list(r) for r in rows]  # generator -> list of lists
    if verbose:
        print( "\n-- %s  %d rows  %d cols" % (
                sheet.title, sheet.max_row, sheet.max_column ))
        for row in rows[:verbose]:
            trimNone = list( filter( None, row[:verbose] ))
            print( trimNone )
    return rows


# xlsxin = sys.argv[1]
wb = openpyxl.load_workbook( xlsxin, read_only=True )
print( "\n-- openpyxl.load_workbook( \"%s\" )" % xlsxin )

for sheetname in wb.sheetnames:
    sheet = wb[sheetname]

    rows = sheet_rows( sheet, nrows=nrows )

    df = (pd.DataFrame( rows )  # index= columns=
            .dropna( axis="index", how="all" )
            .dropna( axis="columns", how="all" ) 
            )
    print( df )
    # df.to_excel df.to_csv ...

pyexcel下的“部分读取” 解释说大多数 Excel 阅读器在做任何其他事情之前将所有数据读入内存 - 慢。 openpyxl iter_rows() 快速获取几行或几列，内存不知道。

【讨论】：