【问题标题】:Little bit complicated pandas dataframe, row to columns有点复杂的熊猫数据框,行到列
【发布时间】:2020-07-25 03:15:14
【问题描述】:

我有一个复杂的数据框。数据框有很多按日期时间和项目划分的块。出处excel:

name    sex age ID  start   end main    data    testtime    item    subitem result  unit    mark    reference   testman comfirmman
LSF female  60  12345678    2018-12-18 08:58    2018-12-29 08:30    knee    11                                  
2018-12-28 13:59    metabolism II               comfirm 12345678                                        
subitem         result  unit    mark    reference                                       
Na          142 mmol/L      135 - 145                                       
K           3.98    mmol/L      3.50 - 5.30                                     
Cl          105 mmol/L      96 - 110                                        
PHOS            1.25    mmol/L      0.97 - 1.62                                     
testman:YYY             comfirmman:AAA                                              
2018-12-28 9:57 routine blood               comfirm 12345678                                        
subitem         result  unit    mark    reference                                       
CRP         14.72   mg/L    ↑   0.00 - 10.00                                        
WBC         6.73    x10^9/L     4.00 - 10.00                                        
NEUT%           0.524           0.460 - 0.750                                       
testman:BBB             comfirmman:EEE                                              

我想将关于列索引的行更改为列。我想要什么:

name    sex age ID  start   end main    data    testtime    item    subitem result  unit    mark    reference   testman comfirmman
LSF female  60  12345678    2018-12-18 08:58    2018-12-29 08:30    knee    11  2018-12-28 13:59    metabolism II   Na  142 mmol/L      135 - 145   YYY AAA
LSF female  60  12345678    2018-12-18 08:58    2018-12-29 08:30    knee    11  2018-12-28 13:59    metabolism II   K   3.98    g/L     3.50 - 5.30 YYY AAA
LSF female  60  12345678    2018-12-18 08:58    2018-12-29 08:30    knee    11  2018-12-28 13:59    metabolism II   Cl  105 mmol/L      96 - 110    YYY AAA
LSF female  60  12345678    2018-12-18 08:58    2018-12-29 08:30    knee    11  2018-12-28 13:59    metabolism II   PHOS    1.25    u/L     0.97 - 1.62 YYY AAA
LSF female  60  12345678    2018-12-18 08:58    2018-12-29 08:30    knee    11  2018-12-28 9:57 routine blood   CRP 14.72   mg/L    ↑   0.00 - 10.00    BBB EEE
LSF female  60  12345678    2018-12-18 08:58    2018-12-29 08:30    knee    11  2018-12-28 9:57 routine blood   WBC 6.73    x10^9/L     4.00 - 10.00    BBB EEE
LSF female  60  12345678    2018-12-18 08:58    2018-12-29 08:30    knee    11  2018-12-28 9:57 routine blood   NEUT%   0.524           0.460 - 0.750   BBB EEE

提前感谢!

【问题讨论】:

    标签: python pandas


    【解决方案1】:

    你可以使用转置方法来做到这一点

    transposed_dataframe = your_dataframe.T
    

    示例:

    import numpy as np
    import pandas as pd
    
    # Just random value
    a = np.random.random(10)
    b = np.random.random(10)
    c = np.random.random(10)
    
    df = pd.DataFrame({'a':a,'b':b,'c':c})
    
    print('Original Dataframe')
    print(df)
    
    transposed_dataframe = df.T
    print('Transposed Dataframe')
    print(df.T)
    

    输出:

    Original Dataframe
              a         b         c
    0  0.254146  0.017214  0.024618
    1  0.958870  0.297118  0.935739
    2  0.492764  0.626654  0.259336
    3  0.979305  0.811364  0.321847
    4  0.723043  0.570478  0.222365
    5  0.717678  0.833348  0.188363
    6  0.695006  0.712678  0.313900
    7  0.071923  0.529029  0.018965
    8  0.868739  0.152821  0.349268
    9  0.766499  0.651031  0.109461
    
    Transposed Dataframe
              0         1         2         3         4         5         6         7         8         9
    a  0.254146  0.958870  0.492764  0.979305  0.723043  0.717678  0.695006  0.071923  0.868739  0.766499
    b  0.017214  0.297118  0.626654  0.811364  0.570478  0.833348  0.712678  0.529029  0.152821  0.651031
    c  0.024618  0.935739  0.259336  0.321847  0.222365  0.188363  0.313900  0.018965  0.349268  0.109461
    

    【讨论】:

    • NONONO!您的答案是针对常规数据。而对我来说,他们应该根据块和列标题反转数据。谢谢
    【解决方案2】:

    从半结构化 Excel 中提取数据总是很难看

    1. 您有两个标题行构成 数据集
    2. 你有多个detail记录插入master的主键
    3. 进一步准备详细数据
    4. 将大师加入细节,您就有了数据
    data = '''name    sex  age  ID  start   end  main    data    testtime    item    subitem result  unit    mark    reference   testman comfirmman
    LSF  female  60  12345678    2018-12-18 08:58    2018-12-29 08:30    knee    11                                  
    2018-12-28 13:59    metabolism II               comfirm 12345678                                        
    subitem         result  unit    mark    reference                                       
    Na          142 mmol/L      135 - 145                                       
    K           3.98    mmol/L      3.50 - 5.30                                     
    Cl          105 mmol/L      96 - 110                                        
    PHOS            1.25    mmol/L      0.97 - 1.62                                     
    testman:YYY             comfirmman:AAA                                              
    2018-12-28 9:57 routine blood               comfirm 12345678                                        
    subitem         result  unit    mark    reference                                       
    CRP         14.72   mg/L    ↑   0.00 - 10.00                                        
    WBC         6.73    x10^9/L     4.00 - 10.00                                        
    NEUT%           0.524           0.460 - 0.750                                       
    testman:BBB             comfirmman:EEE                                              '''
    # first two rows are master data
    h = [[t.strip() for t in re.split("  ", l) if t!=""] for l in data.split("\n")[:2] ]
    h[0][:len(h[1])] # strip columns down to number of data items found
    hf = pd.DataFrame(h[1:], columns=h[0][:len(h[1])])
    # insert ID into detail data
    d = [[hf.loc[0:,"ID"].values[0]]+[t.strip() for t in re.split("  ", l) if t.strip()!=""] for l in data.split("\n")[3:] ]
    d[0][0] = "ID" # modify column header
    df = pd.DataFrame(d[1:], columns=d[0])
    # find the rows that have testman and confirmman
    rows = df[df["subitem"].str.contains("testman")].index.values
    # update each row with testman and confirmman
    for i, r in enumerate(rows):
        rs = 0 if i==0 else rows[i-1]+1
        df.loc[rs:r-1, "testman"] = df.loc[r:r,"subitem"].values[0].replace("testman:", "")
        df.loc[rs:r-1, "confirmman"] = df.loc[r:r,"result"].values[0].replace("comfirmman:", "")
    
    df.loc[df["unit"].isna(),"testman"] = np.nan  # a bit more cleanup
    # join it all together excluding detail rows that are not test results
    hf.merge(df[~df["testman"].isna()], on="ID")
    
    

    输出

    name    sex age ID  start   end main    data    subitem result  unit    mark    reference   testman confirmman
    0   LSF female  60  12345678    2018-12-18 08:58    2018-12-29 08:30    knee    11  Na  142 mmol/L  135 - 145   None    None    YYY AAA
    1   LSF female  60  12345678    2018-12-18 08:58    2018-12-29 08:30    knee    11  K   3.98    mmol/L  3.50 - 5.30 None    YYY AAA
    2   LSF female  60  12345678    2018-12-18 08:58    2018-12-29 08:30    knee    11  Cl  105 mmol/L  96 - 110    None    None    YYY AAA
    3   LSF female  60  12345678    2018-12-18 08:58    2018-12-29 08:30    knee    11  PHOS    1.25    mmol/L  0.97 - 1.62 None    YYY AAA
    4   LSF female  60  12345678    2018-12-18 08:58    2018-12-29 08:30    knee    11  subitem result  unit    mark    reference   BBB EEE
    5   LSF female  60  12345678    2018-12-18 08:58    2018-12-29 08:30    knee    11  CRP 14.72   mg/L    ↑   0.00 - 10.00    BBB EEE
    6   LSF female  60  12345678    2018-12-18 08:58    2018-12-29 08:30    knee    11  WBC 6.73    x10^9/L 4.00 - 10.00    None    BBB EEE
    7   LSF female  60  12345678    2018-12-18 08:58    2018-12-29 08:30    knee    11  NEUT%   0.524   0.460 - 0.750   None    None    BBB EEE
    
    

    【讨论】:

      猜你喜欢
      • 2020-06-03
      • 1970-01-01
      • 2016-08-20
      • 1970-01-01
      • 1970-01-01
      • 2022-11-15
      • 1970-01-01
      • 2017-09-27
      • 1970-01-01
      相关资源
      最近更新 更多