Pivot Pandas Dataframe with Duplicates using Masking答案

【问题标题】：Pivot Pandas Dataframe with Duplicates using MaskingPivot Pandas Dataframe with Duplicates using Masking
【发布时间】：2017-05-01 20:50:25
【问题描述】：

非索引 df 包含基因行、包含该基因突变的单元格以及该基因中的突变类型：

df = pd.DataFrame({'gene': ['one','one','one','two','two','two','three'],
                       'cell': ['A', 'A', 'C', 'A', 'B', 'C','A'],
                       'mutation': ['frameshift', 'missense', 'nonsense', '3UTR', '3UTR', '3UTR', '3UTR']})

df:

  cell   gene    mutation
0    A    one  frameshift
1    A    one    missense
2    C    one    nonsense
3    A    two        3UTR
4    B    two        3UTR
5    C    two        3UTR
6    A  three        3UTR

我想旋转这个 df，以便我可以按基因索引并将列设置为单元格。问题是每个单元格可以有多个条目：给定单元格中的任何一个基因都可以有多个突变（单元格 A 在基因 One 中有两个不同的突变）。所以当我跑步时：

df.pivot_table(index='gene', columns='cell', values='mutation')

发生这种情况：

DataError: No numeric types to aggregate

我想使用掩码来执行枢轴，同时捕获至少一个突变的存在：

       A  B  C
gene          
one    1  1  1
two    0  1  0
three  1  1  0

【问题讨论】：

标签： python pandas group-by pivot-table reshape

【解决方案1】：

drop_duplicates 和 pivot_table 的解决方案：

df = df.drop_duplicates(['cell','gene'])
       .pivot_table(index='gene', 
                    columns='cell', 
                    values='mutation',
                    aggfunc=len, 
                    fill_value=0)
print (df)
cell   A  B  C
gene          
one    1  0  1
three  1  0  0
two    1  1  1

另一个解决方案是drop_duplicates、groupby 和聚合size，最后由unstack 重塑：

df = df.drop_duplicates(['cell','gene'])
       .groupby(['cell', 'gene'])
       .size()
       .unstack(0, fill_value=0)
print (df)
cell   A  B  C
gene          
one    1  0  1
three  1  0  0
two    1  1  1

【讨论】：

【解决方案2】：

错误消息不是运行pivot_table 时产生的。 pivot_table 的索引中可以有多个值。我不相信pivot 方法是这样的。但是，您可以通过将聚合更改为适用于字符串而不是数字的东西来解决您的问题。大多数聚合函数对数字列进行操作，您上面编写的代码会产生与列的数据类型相关的错误，而不是索引错误。

df.pivot_table(index='gene',
               columns='cell',
               values='mutation',
               aggfunc='count', fill_value=0)

如果您只希望每个单元格有 1 个值，您可以执行 groupby 并将所有内容聚合为 1，然后取消堆叠一个级别。

df.groupby(['cell', 'gene']).agg(lambda x: 1).unstack(fill_value=0)

【讨论】：