【问题标题】:split dataframe by empty row and reshape by headers按空行拆分数据帧并按标题重塑
【发布时间】:2017-09-19 16:41:29
【问题描述】:

我有一个 csv 表,每张表有多个表格,如下所示:

Name     Header-1     Header-8     Header 3
Random Note
Jack     X                         X
Jane                    X
NAN      NAN          NAN          NAN
Name     Header 3     Header 2     Header 7
Random note
Jeremy   X            X
Joey                               X

我可以将表格按空白行拆分,然后将它们重新整形为一个数据框,结果如下:

Name     Header-1     Header-2     Header-3     .....
Jack     X
Jane                    X
Jeremy                              X
Joey         X          X            X

我想将空白行用作新索引并将每个表读取为新的 df。每个表的标题都是相同的,只是它们的顺序不正确。最终 - 我想将它们重新拼接成一个干净的 DF。

【问题讨论】:

    标签: python pandas dataframe reshape munge


    【解决方案1】:

    假设您的 csv 设置如下:

    Name,Header-1,Header-2,Header-3
    Random,Note, , 
    Jack,X,X,   
    Jane,X, , 
    ,,,
    Name,Header-3,Header-2,Header-1
    Random,note, , 
    Jeremy,X,X, 
    Joey, , ,X
    

    您可以使用以下不言自明的代码处理此文件:

    import pandas as pd
    # Read csv file
    df = pd.read_csv("D:/tmp/data.csv", sep=',')
    
    #Find columns which are null, create partitions and group by them
    isnull = (df["Name"].isnull())
    partitions = (isnull != isnull.shift()).cumsum()
    gb = df[~isnull].groupby(partitions)
    keys = gb.groups.keys()
    
    # Extract all the dataframes
    dfs = [gb.get_group(g) for g in keys]
    
    datas = []
    # Set the header as first row for all dataframes that are not the first one
    for i,data in enumerate(dfs):
        if i!=0:    # First dataframe has already set the correct header
            data.columns = data.ix[data.index[0]]
            data = data.drop(data.index[0])
        datas.append(data)
    
    # Concatenate the dataframes and reset the index
    df_concat = pd.concat(datas)
    df_out = df_concat.reset_index(drop=True)
    
    # Change the order of the columns to get "Name" as first column
    cols = df_out.columns.tolist()
    cols = cols[-1:] + cols[:-1]
    df_out = df_out[cols]
    

    所以你的输入是:

    >>> df
         Name  Header-1  Header-2  Header-3
    0  Random      Note                    
    1    Jack         X         X           
    2    Jane         X                    
    3     NaN       NaN       NaN       NaN
    4    Name  Header-3  Header-2  Header-1
    5  Random      note                    
    6  Jeremy         X         X          
    7    Joey                             X
    

    请注意,在此示例中,标头在要提取的第二个数据帧中的顺序不同。

    您的输出将是:

    >>> df_out
         Name Header-1 Header-2 Header-3
    0  Random     Note                  
    1    Jack        X        X         
    2    Jane        X                  
    3  Random                       note
    4  Jeremy                 X        X
    5    Joey        X                  
    

    【讨论】:

    • 我在第 6 行“名称”处遇到 KeyError - 我已将其换成另一个列名...
    • 您尝试过我展示的示例吗?请使用您遇到错误的示例更新您的问题。
    猜你喜欢
    • 1970-01-01
    • 2019-11-24
    • 2011-12-22
    • 2020-02-01
    • 1970-01-01
    • 1970-01-01
    • 2021-04-27
    • 2020-11-24
    • 1970-01-01
    相关资源
    最近更新 更多