【问题标题】:Python - Pandas dataframe with tuplesPython - 带有元组的 Pandas 数据框
【发布时间】:2016-12-14 08:47:27
【问题描述】:

我有这种数据框:

      A       B       C       D
0   (a,b)   (c,d)   (e,f)   (g,h)
1   (a,b)   (c,d)   (e,f)    NaN
2   (a,b)    NaN    (e,f)    NaN
3   (a,b)    NaN     NaN     NaN

所以在每个单元格中都有一个元组,我想让它像这样:

  |    A     |     B     |     C     |     D
0 |  a |  b  |  c  |  d  |  e  |  f  |  g  |  h
1 |  a |  b  |  c  |  d  |  e  |  f  | NaN | NaN
2 |  a |  b  | NaN | NaN |  e  |  f  | NaN | NaN
3 |  a |  b  | NaN | NaN | NaN | NaN | NaN | NaN

例如,在A列中,里面有两列。

谢谢。

【问题讨论】:

  • 为什么不想为每个字母创建两列(例如A1A2)?

标签: python pandas dataframe tuples


【解决方案1】:

您可以将stackDataFrame.from_records 一起使用,然后使用unstackswaplevel 重塑列中MultiIndex 中的更改级别,最后按sort_index 对列进行排序:

stacked = df.stack()
df1 = pd.DataFrame.from_records(stacked.tolist(), index = stacked.index)
        .unstack(1)
        .swaplevel(0, 1, 1)
        .sort_index(axis=1)
        .replace({None:np.nan})
print (df1)

   A       B         C         D     
   0  1    0    1    0    1    0    1
0  a  b    c    d    e    f    g    h
1  a  b    c    d    e    f  NaN  NaN
2  a  b  NaN  NaN    e    f  NaN  NaN
3  a  b  NaN  NaN  NaN  NaN  NaN  NaN

最后可以从列中删除 MultiIndex 并创建新的列名:

stacked = df.stack()
df1 = pd.DataFrame.from_records(stacked.tolist(), index = stacked.index)
        .unstack(1)
        .swaplevel(0, 1, 1)
        .sort_index(1)
        .replace({None:np.nan})
df1.columns = ['{}{}'.format(col[0], col[1]) for col in df1.columns]
print (df1)
  A0 A1   B0   B1   C0   C1   D0   D1
0  a  b    c    d    e    f    g    h
1  a  b    c    d    e    f  NaN  NaN
2  a  b  NaN  NaN    e    f  NaN  NaN
3  a  b  NaN  NaN  NaN  NaN  NaN  NaN

时间安排

#len (df)=400

In [220]: %timeit (pir(df))
100 loops, best of 3: 3.45 ms per loop

In [221]: %timeit (jez(df))
100 loops, best of 3: 5.17 ms per loop

In [222]: %timeit (nick(df))
1 loop, best of 3: 231 ms per loop

In [223]: %timeit (df.stack().apply(pd.Series).unstack().swaplevel(0, 1, 1).sort_index(1).replace({None:np.nan}))
10 loops, best of 3: 152 ms per loop


#len (df)=4k

In [216]: %timeit (pir(df))
100 loops, best of 3: 16.5 ms per loop

In [217]: %timeit (jez(df))
100 loops, best of 3: 14.8 ms per loop

In [218]: %timeit (nick(df))
1 loop, best of 3: 2.34 s per loop

In [219]: %timeit (df.stack().apply(pd.Series).unstack().swaplevel(0, 1, 1).sort_index(1).replace({None:np.nan}))
1 loop, best of 3: 1.53 s per loop

计时码

df = pd.DataFrame({"A": [('a','b'),('a','b'),('a','b'),('a','b')], 
                   'B': [('c','d'),('c','d'), np.nan,np.nan], 
                   'C':[('e','f'),('e','f'),('e','f'),np.nan],
                   'D':[('g','h'),np.nan,np.nan,np.nan]})

df = pd.concat([df]*1000).reset_index(drop=True)
print (df)

def jez(df):
    stacked = df.stack()
    return pd.DataFrame.from_records(stacked.tolist(), index = stacked.index).unstack(1).swaplevel(0, 1, 1).sort_index(1).replace({None:np.nan})


print (df.stack().apply(pd.Series).unstack().swaplevel(0, 1, 1).sort_index(1).replace({None:np.nan}))

def nick(df):
    cols = df.columns.values.tolist()
    return pd.concat([df[col].apply(pd.Series) for col in cols], axis=1, keys=cols)

def pir(df):
    # fillna with (np.nan, np.nan)
    df_ = df.stack().unstack(fill_value=tuple([np.nan] * 2))
    # construct MultiIndex
    col = pd.MultiIndex.from_product([df.columns, [0, 1]])
    # rip off of Nickil's pd.concat but using numpy
    return pd.DataFrame(np.hstack([np.array(s.values.tolist()) for _, s in df_.iteritems()]), columns=col)


print (jez(df))
print (nick(df))
print (pir(df))

【讨论】:

    【解决方案2】:

    使用apply 将每个系列中的tuples 拆分为单独的元素。然后,将所有这些列按列连接在一起,并使用 keys 参数提供与原始 DF 相同的标题。

    cols = df.columns.values.tolist()
    pd.concat([df[col].apply(pd.Series) for col in cols], axis=1, keys=cols)
    

    【讨论】:

      【解决方案3】:

      methon 1
      stack + apply

      df.stack().apply(pd.Series).unstack().swaplevel(0, 1, 1).sort_index(1)
      

      方法2

      # fillna with (np.nan, np.nan)
      df_ = df.stack().unstack(fill_value=tuple([np.nan] * 2))
      # construct MultiIndex
      col = pd.MultiIndex.from_product([df.columns, [0, 1]])
      # rip off of Nickil's pd.concat but using numpy
      pd.DataFrame(
          np.hstack(
              [np.array(s.values.tolist()) \
               for _, s in df_.iteritems()]
          ), columns=col)
      

      【讨论】:

      • 嗯,我认为在第二个解决方案中重复列不是一个好主意,你怎么看?
      • @jezrael 我决定改变它
      • 我添加了计时。在更大的数据框中,我的代码似乎更快。
      猜你喜欢
      • 1970-01-01
      • 2022-01-09
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-10-20
      • 2021-02-06
      • 2021-01-02
      • 1970-01-01
      相关资源
      最近更新 更多