【问题标题】:pandas: When cell contents are lists/ NaN/ string, create a row for each elementpandas:当单元格内容为列表/ NaN/字符串时,为每个元素创建一行
【发布时间】:2018-07-19 04:24:22
【问题描述】:

你好,我有一个类似下面的df

index a  b  c  d
0     xx aa av NaN
1     pp as ka [1,2,3,4]
2     pa aj q  1234
3     xq aq aq NaN
4     pn an kn [10,20,30,40]
5     px ax kx "00012" 

我想转换成下面的样子

index a  b  c  d              d-separated
0     xx aa av NaN            NaN
1     pp as ka [1,2,3,4]      1
2     pp as ka [1,2,3,4]      2
3     pp as ka [1,2,3,4]      3
4     pp as ka [1,2,3,4]      4
5     pa aj q  1234           1234
6     xq aq aq NaN            NaN
7     pn an kn [10,20,30,40]  10
8     pn an kn [10,20,30,40]  20
9     pn an kn [10,20,30,40]  30
10    pn an kn [10,20,30,40]  40
11    px ax kx "00012"        "00012"

我参考了

pandas: When cell contents are lists, create a row for each element in the list

Split (explode) pandas dataframe string entry to separate rows

但是,由于我的情况与他们不同。该解决方案在我的示例中不起作用。谢谢你的帮助

【问题讨论】:

    标签: python python-3.x pandas


    【解决方案1】:

    设置

    df = pd.DataFrame({'a': ['xx', 'pp', 'pa', 'xq', 'pn', 'px'], 'b': ['aa', 'as', 'aj', 'aq', 'an', 'ax'], 'c': ['av', 'ka', 'q', 'aq', 'kn', 'kx'], 'd': [np.nan, [1,2,3,4], 1234, np.nan, [10, 20, 30, 40], '00012']})
    

    这是一个棘手的问题,主要是因为NaN's,所以我先用填充值替换它们,然后在最后将它们改回来:

    (df.join(df.fillna(-999)
        .d.apply(pd.Series))
        .drop('d', 1).set_index(['a', 'b', 'c'])
        .stack().reset_index()
        .drop('level_3',1)
        .replace(-999, np.nan).rename(columns={0: 'd-separated'})
    )
    
         a   b   c d-separated
    0   xx  aa  av         NaN
    1   pp  as  ka           1
    2   pp  as  ka           2
    3   pp  as  ka           3
    4   pp  as  ka           4
    5   pa  aj   q        1234
    6   xq  aq  aq         NaN
    7   pn  an  kn          10
    8   pn  an  kn          20
    9   pn  an  kn          30
    10  pn  an  kn          40
    11  px  ax  kx       00012
    

    这个确实但是丢失了原始的d 列,因为它包含不可散列的类型,所以它不能设置为索引的级别。

    【讨论】:

      【解决方案2】:

      这是可能的,但不是微不足道的 - 对于索引 id 列,必须将 lists 转换为 tuples 用于可散列类型,并将 DataFrame 从构造函数标量转换为一个元素 lists:

      df = pd.DataFrame({'a': ['xx', 'pp', 'pa', 'xq', 'pn', 'px'], 
                         'b': ['aa', 'as', 'aj', 'aq', 'an', 'ax'], 
                         'c': ['av', 'ka', 'q', 'aq', 'kn', 'kx'], 
                         'd': [np.nan, [1,2,3,4], '1234', np.nan, [10, 20, 30, 40], '00012']})
      
      
      s = (df.assign(d1=df['d'].fillna('NANval').apply(lambda x: x if isinstance(x, list) else [x]),
                     d = df['d'].apply(lambda x: tuple(x) if isinstance(x, list) else x))
             .set_index(['a','b','c','d'])['d1']
             )
      print (s)
      a   b   c   d               
      xx  aa  av  NaN                         [NANval]
      pp  as  ka  (1, 2, 3, 4)            [1, 2, 3, 4]
      pa  aj  q   1234                          [1234]
      xq  aq  aq  NaN                         [NANval]
      pn  an  kn  (10, 20, 30, 40)    [10, 20, 30, 40]
      px  ax  kx  00012                        [00012]
      Name: d1, dtype: object
      

      df = (pd.DataFrame(s.values.tolist(), index=s.index)
              .stack()
              .reset_index(4, drop=True)
              .reset_index(name='d-separated')
              .replace('NANval', np.nan)
              )
      

      如有必要,最后将tuples 转换为lists:

      df['d'] = df['d'].apply(lambda x: list(x) if isinstance(x, tuple) else x)
      print (df)
      
           a   b   c                 d d-separated
      0   xx  aa  av               NaN         NaN
      1   pp  as  ka      [1, 2, 3, 4]           1
      2   pp  as  ka      [1, 2, 3, 4]           2
      3   pp  as  ka      [1, 2, 3, 4]           3
      4   pp  as  ka      [1, 2, 3, 4]           4
      5   pa  aj   q              1234        1234
      6   xq  aq  aq               NaN         NaN
      7   pn  an  kn  [10, 20, 30, 40]          10
      8   pn  an  kn  [10, 20, 30, 40]          20
      9   pn  an  kn  [10, 20, 30, 40]          30
      10  pn  an  kn  [10, 20, 30, 40]          40
      11  px  ax  kx             00012       00012
      

      【讨论】:

        【解决方案3】:

        首先将数据框扩展至所需大小,根据需要重复每一行:

        df1 = df.loc[df.index.repeat([len(x) if isinstance(x,list) else 1 for x in df.d])]
        

        现在取消列 d 并将其与上面的 df 连接

        d_sep= pd.DataFrame({'d_Sep':sum([x if isinstance(x,list) else [x] for x in df.d],[])})
        
        df2 = pd.concat([df1.reset_index(drop=True),d_sep],axis=1)
        
           a   b   c                 d  d_Sep
        0   xx  aa  av               NaN    NaN
        1   pp  as  ka      [1, 2, 3, 4]      1
        2   pp  as  ka      [1, 2, 3, 4]      2
        3   pp  as  ka      [1, 2, 3, 4]      3
        4   pp  as  ka      [1, 2, 3, 4]      4
        5   pa  aj   q              1234   1234
        6   xq  aq  aq               NaN    NaN
        7   pn  an  kn  [10, 20, 30, 40]     10
        8   pn  an  kn  [10, 20, 30, 40]     20
        9   pn  an  kn  [10, 20, 30, 40]     30
        10  pn  an  kn  [10, 20, 30, 40]     40
        11  px  ax  kx             00012  00012
        

        【讨论】:

        • 对不起,但是它显示ValueError:操作数无法与形状一起广播(768329,)(2,)
        猜你喜欢
        • 2015-01-31
        • 2021-03-29
        • 1970-01-01
        • 2023-02-05
        • 1970-01-01
        • 2015-12-29
        • 2017-02-02
        • 1970-01-01
        相关资源
        最近更新 更多