这里的操作顺序不正确。行:
df['Month'] = str(files[file])
要用最新的值覆盖整个列。
相反,我们应该只将值添加到 current DataFrame:
import os
import pandas as pd
paths = "C://Users//6J2754897//Downloads//monthlydata"
files = os.listdir(paths)
df = pd.DataFrame()
for file in range(len(files)):
if files[file].endswith('.xlsx'):
# Read in File
file_df = pd.read_excel(paths + "//" + files[file],
sheet_name="information",
skiprows=7)
# Add to just this DataFrame
file_df['Month'] = str(files[file])
# Update `df`
df = df.append(file_df, ignore_index=True)
或者,我们可以使用DataFrame.assign 链接列分配:
import os
import pandas as pd
paths = "C://Users//6J2754897//Downloads//monthlydata"
files = os.listdir(paths)
df = pd.DataFrame()
for file in range(len(files)):
if files[file].endswith('.xlsx'):
# Read in File
df = df.append(
# Read in File
pd.read_excel(paths + "//" + files[file],
sheet_name="information",
skiprows=7)
.assign(Month=str(files[file])), # Add to just this DataFrame
ignore_index=True
)
对于一般的整体改进,我们可以使用pd.concat 对文件进行列表理解。这样做是为了避免增长 DataFrame(这可能非常慢)。 Pathlib.glob 还可以帮助您选择合适的文件:
from pathlib import Path
import pandas as pd
paths = "C://Users//6J2754897//Downloads//monthlydata"
df = pd.concat([
pd.read_excel(file,
sheet_name="information",
skiprows=7)
.assign(Month=file.stem) # We may also want file.name here
for file in Path(paths).glob('*.xlsx')
])
月份列的一些选项是:
-
file.stem 将给出“[t]he final path component, without its suffix”。
- “文件夹/文件夹/sample.xlsx”->“样本”
-
file.name 将给出“最终路径组件,不包括驱动器和根目录”。
- '文件夹/文件夹/sample.xlsx' -> 'sample.xlsx'