使用 python pandas 在 for 循环中有效连接/附加数据帧以获取单个大数据帧答案

【问题标题】：使用 python pandas 在 for 循环中有效连接/附加数据帧以获取单个大数据帧
【发布时间】：2022-01-23 16:41:12
【问题描述】：

使用逻辑 - 我正在阅读多个具有某些突出显示部分的 PDF 文件（假设这些是表格）。

将它们推送到列表后，我将它们保存到数据框。这是相同的逻辑

    try:
        filepath = [file for file in glob.glob("Folder/*.pdf")]
        for file in filepath:
            doc = fitz.open(file)
            print(file)

            highlights = []
            for page in doc:
                highlights += handle_page(page)

            #print(highlights)
            highlights_alt = highlights[0].split(',')
            df = pd.DataFrame(highlights_alt, columns=['Security Name'])
            #print(df.columns.tolist())
            df[['Security Name', 'Weights']] = df['Security Name'].str.rsplit(n=1, expand=True)
            df.drop_duplicates(keep='first', inplace=True)
            print(df.head())
            print(df.shape)
    except IndexError:
        print('file {} is not highlighted'.format(file))

使用这个逻辑我得到了数据帧，但是如果文件夹有 5 个 PDF，那么这个逻辑会创建 5 个不同的数据帧。像这样。

Folder\z.pdf
Security Name Weights
0     UTILITIES   (5.96
1           %*)    None
(2, 2)

Folder\y.pdf
 Security Name Weights
0  Quantity/ Market Value % of Net Investments Cu...   1.125
1                                                  %      01
2                                                /07    None
3                                              /2027    None
4                                                EUR     230
(192, 2)

Folder\x.pdf
Security Name Weights
0                  Holding    £740
1                      000    None
2   Leeds Building Society    3.75
3               % variable      25
4                       /4    None
(526, 2)

但是我想要一个包含上述记录的单个数据框，使其形状为(720,2) 之类的东西

Security Name Weights
0                  Holding    £740
1                      000    None
2   Leeds Building Society    3.75
3               % variable      25
4                       /4    None
.
.
720  xyz                      3.33
(720, 2)

我尝试使用 pandas 的concat & append，但到目前为止一直没有成功。请告诉我一种有效的方法，因为将来的 PDF 将超过 1000 个。

请帮忙！

【问题讨论】：

标签： python pandas

【解决方案1】：

一个快速的方法是使用pd.concat:

big_df = pd.concat(list_of_dfs, axis=0)

如果这会产生错误，了解错误是什么会很有帮助。

【讨论】：

我得到这个错误：TypeError: first argument must be an iterable of pandas objects, you pass a object of type "DataFrame"
确保list_of_dfs 确实是一个列表，错误提示它是一个数据框。
我使用了final_df = pd.concat([df], axis=0) 它仍然给出了与我的问题相同的输出
这很奇怪，但很难远程调试。
你可以看看我写的代码。在for循环帮助之前创建数据框吗？我已经在循环本身中创建了它