【问题标题】:Adding a column via loop to dataframe通过循环向数据框添加列
【发布时间】:2022-01-20 10:56:09
【问题描述】:

我正在尝试创建一个循环,该循环设置一个带有索引 (df1) 的数据框并遍历选定的文件夹,找到一个 txt 文件并提取第二列(称为计数)并将其添加到 df1。然后它继续遍历文件夹并对下一个文件执行相同的操作,将其添加到 df1。结果,它应该给我一个已处理的 txt 文件,其中包含索引和第一个文件的计数列,下一列包含第二个文件的计数,依此类推。

我的循环确实存在问题,无法让它停止覆盖第一个 txt 文件计数。最重要的是,它一直将新的列标题视为数据单元格,这会使所有内容失去平衡。就目前而言,它只是覆盖并在本应成为下一列的第一行中留下一个随机整数。

任何帮助将不胜感激。为打印行数道歉,我只是想确定我理解每个步骤在做什么。

    def changeFolder(self):
    folder = QFileDialog.getExistingDirectory(None, 'Project Data', '.csv files')
    print(folder)
    if folder == None:
        return
    else:
        print(folder)

    from glob import glob
    import pandas as pd
    import numpy as np
    import os
    # create lag variable for the time lag array from -50 to 50
    lag = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
           21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46,
           47, 48, 49, 50, 51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99]
    #generate data frame with the lag time in one column
    df1 = pd.DataFrame(index=lag)
    #print
    print('df1', df1)
    #for every file in the directory folder specified
    for file in os.listdir(folder):
        print('folder', folder)
        if file.endswith(".txt"):
            print('file', file)
            selfolder = folder
            newpath = os.path.abspath(os.path.join(selfolder, file))
            print('newpath', newpath)
            #read the file in the loop
            df2 = pd.read_csv(newpath, delimiter=" ", dtype="Int64", header=None)
            df2.to_string(index=False)
            #df2.columns = ['Lag', 'Counts']
            #take the second column of said folder and save it to the original dataframe
            print('df2', df2)
            #counts = df2.iloc[:,1]
            print('now for the counts')
            print(df2.iloc[:,1])
            df2['count'] = df2.iloc[:,1]
            df1['df1count'] = df2['count']
            df1.df1count = df1.df1count.astype(float)
            print(df1.df1count)
            count_df = pd.DataFrame(data={len(df2['count'].groupby(df2['count']))}, columns=['test'])
            new_df = pd.concat([df1, count_df], axis=1)
            print(new_df)
        continue
    savepath = newpath[:-4]
    #save and convert to .txt
    new_df.to_csv(savepath + ' processed.txt')

    ##Dialogue box in case of success
    mbox = QMessageBox()
    mbox.setText("Hopefully this worked!")
    mbox.setDetailedText("")
    mbox.setStandardButtons(QMessageBox.Ok)
    mbox.setWindowTitle('CSV Batch Processor')
    mbox.exec_()

【问题讨论】:

    标签: python pandas dataframe loops


    【解决方案1】:

    您能解释一下哪个数据框包含什么吗?这将有助于回答适合您示例的代码。假设它是 new_df,pd.concat 返回您提供给它的列表的串联,即 [df, count_df],因此对先前串联结果的引用会丢失。如果您事先实例化 new_df 并将其包含在列表中以连接它应该没问题:

    new_df = pd.DataFrame()
    for file in os.listdir(folder):
            print('folder', folder)
            if file.endswith(".txt"):
                print('file', file)
                selfolder = folder
                newpath = os.path.abspath(os.path.join(selfolder, file))
                print('newpath', newpath)
                #read the file in the loop
                df2 = pd.read_csv(newpath, delimiter=" ", dtype="Int64", header=None)
                df2.to_string(index=False)
                #df2.columns = ['Lag', 'Counts']
                #take the second column of said folder and save it to the original dataframe
                print('df2', df2)
                #counts = df2.iloc[:,1]
                print('now for the counts')
                print(df2.iloc[:,1])
                df2['count'] = df2.iloc[:,1]
                df1['df1count'] = df2['count']
                df1.df1count = df1.df1count.astype(float)
                print(df1.df1count)
                count_df = pd.DataFrame(data={len(df2['count'].groupby(df2['count']))}, columns=['test'])
                # Change the previous assignment to:
                new_df = pd.concat([new_df, df1, count_df], axis=1)
                print(new_df)
            continue
    # etc.
    

    每次这样,前面的结果都会连接起来,你应该达到你想要的效果

    【讨论】:

      猜你喜欢
      • 2019-03-02
      • 1970-01-01
      • 1970-01-01
      • 2013-05-25
      • 2022-01-14
      • 2019-12-26
      • 1970-01-01
      • 2021-10-11
      • 1970-01-01
      相关资源
      最近更新 更多