通过循环向数据框添加列答案

【问题标题】：Adding a column via loop to dataframe通过循环向数据框添加列
【发布时间】：2022-01-20 10:56:09
【问题描述】：

我正在尝试创建一个循环，该循环设置一个带有索引 (df1) 的数据框并遍历选定的文件夹，找到一个 txt 文件并提取第二列（称为计数）并将其添加到 df1。然后它继续遍历文件夹并对下一个文件执行相同的操作，将其添加到 df1。结果，它应该给我一个已处理的 txt 文件，其中包含索引和第一个文件的计数列，下一列包含第二个文件的计数，依此类推。

我的循环确实存在问题，无法让它停止覆盖第一个 txt 文件计数。最重要的是，它一直将新的列标题视为数据单元格，这会使所有内容失去平衡。就目前而言，它只是覆盖并在本应成为下一列的第一行中留下一个随机整数。

任何帮助将不胜感激。为打印行数道歉，我只是想确定我理解每个步骤在做什么。

    def changeFolder(self):
    folder = QFileDialog.getExistingDirectory(None, 'Project Data', '.csv files')
    print(folder)
    if folder == None:
        return
    else:
        print(folder)

    from glob import glob
    import pandas as pd
    import numpy as np
    import os
    # create lag variable for the time lag array from -50 to 50
    lag = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
           21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46,
           47, 48, 49, 50, 51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99]
    #generate data frame with the lag time in one column
    df1 = pd.DataFrame(index=lag)
    #print
    print('df1', df1)
    #for every file in the directory folder specified
    for file in os.listdir(folder):
        print('folder', folder)
        if file.endswith(".txt"):
            print('file', file)
            selfolder = folder
            newpath = os.path.abspath(os.path.join(selfolder, file))
            print('newpath', newpath)
            #read the file in the loop
            df2 = pd.read_csv(newpath, delimiter=" ", dtype="Int64", header=None)
            df2.to_string(index=False)
            #df2.columns = ['Lag', 'Counts']
            #take the second column of said folder and save it to the original dataframe
            print('df2', df2)
            #counts = df2.iloc[:,1]
            print('now for the counts')
            print(df2.iloc[:,1])
            df2['count'] = df2.iloc[:,1]
            df1['df1count'] = df2['count']
            df1.df1count = df1.df1count.astype(float)
            print(df1.df1count)
            count_df = pd.DataFrame(data={len(df2['count'].groupby(df2['count']))}, columns=['test'])
            new_df = pd.concat([df1, count_df], axis=1)
            print(new_df)
        continue
    savepath = newpath[:-4]
    #save and convert to .txt
    new_df.to_csv(savepath + ' processed.txt')

    ##Dialogue box in case of success
    mbox = QMessageBox()
    mbox.setText("Hopefully this worked!")
    mbox.setDetailedText("")
    mbox.setStandardButtons(QMessageBox.Ok)
    mbox.setWindowTitle('CSV Batch Processor')
    mbox.exec_()

【问题讨论】：

标签： python pandas dataframe loops

【解决方案1】：

您能解释一下哪个数据框包含什么吗？这将有助于回答适合您示例的代码。假设它是 new_df，pd.concat 返回您提供给它的列表的串联，即 [df, count_df]，因此对先前串联结果的引用会丢失。如果您事先实例化 new_df 并将其包含在列表中以连接它应该没问题：

new_df = pd.DataFrame()
for file in os.listdir(folder):
        print('folder', folder)
        if file.endswith(".txt"):
            print('file', file)
            selfolder = folder
            newpath = os.path.abspath(os.path.join(selfolder, file))
            print('newpath', newpath)
            #read the file in the loop
            df2 = pd.read_csv(newpath, delimiter=" ", dtype="Int64", header=None)
            df2.to_string(index=False)
            #df2.columns = ['Lag', 'Counts']
            #take the second column of said folder and save it to the original dataframe
            print('df2', df2)
            #counts = df2.iloc[:,1]
            print('now for the counts')
            print(df2.iloc[:,1])
            df2['count'] = df2.iloc[:,1]
            df1['df1count'] = df2['count']
            df1.df1count = df1.df1count.astype(float)
            print(df1.df1count)
            count_df = pd.DataFrame(data={len(df2['count'].groupby(df2['count']))}, columns=['test'])
            # Change the previous assignment to:
            new_df = pd.concat([new_df, df1, count_df], axis=1)
            print(new_df)
        continue
# etc.

每次这样，前面的结果都会连接起来，你应该达到你想要的效果

【讨论】：