在 Pandas 数据框中对列块求和 - 按行计算答案

【问题标题】：Summing chunks of columns - row wise - in Pandas dataframe在 Pandas 数据框中对列块求和 - 按行计算
【发布时间】：2015-05-07 02:24:01
【问题描述】：

使用以下代码：

import pandas as pd
df = pd.DataFrame({'ProbeGenes' : ['1431492_at Lipn', '1448678_at Fam118a','1452580_a_at Mrpl21'],
                   '(5)foo.ID.LN.x1' : [20.3, 25.3,3.1],
                   '(5)foo.ID.LN.x2' : [130, 150,173],        
                   '(5)foo.ID.LN.x3' : [1.0, 2.0,12.0],         
                   '(3)bar.ID.LN.x1' : [1,2,3],
                   '(3)bar.ID.LN.x2' : [4,5,6],        
                   '(3)bar.ID.LN.x3' : [7,8,9]        
                   })


new_cols = df.pop("ProbeGenes").str.split().apply(pd.Series)
new_cols.columns = ["Probe","Gene"]
df = df.join(new_cols)
cols = df.columns.tolist()
cols = cols[-2:] + cols[:-2]
df = df[cols]
df

我可以制作如下数据框：

          Probe     Gene  (5)bar.ID.LN.x1  (5)bar.ID.LN.x2  (5)bar.ID.LN.x3  \
0    1431492_at     Lipn                1                4                7
1    1448678_at  Fam118a                2                5                8
2  1452580_a_at   Mrpl21                3                6                9

   (3)foo.ID.LN.x1  (3)foo.ID.LN.x2  (3)foo.ID.LN.x3
0             20.3              130                1
1             25.3              150                2
2              3.1              173               12

请注意，数据帧包含两个块（名为foo 和bar），每个块依次包含x1,x2,x3。我想要做的是总结每个块中的值，从而产生这个数据框：

          Probe     Gene  foo   bar
     1431492_at     Lipn  151.3 12
     1448678_at  Fam118a  177.3 15
   1452580_a_at   Mrpl21  188.1 18

实际数据可以包含两个以上的块名称。每个块将包含 2 或 3 个成员（x1,x2 或 x1,x2,x3）。

可以使用以下正则表达式/\(\d+\)(\w+)\..*/捕获块名称

我怎样才能做到这一点？

【问题讨论】：

标签： python regex pandas

【解决方案1】：

如果数据量很小，一个选项

df['foo'] = df.filter(regex='foo').sum(axis=1) # It will filter all the columns which has the word 'foo' in it
df['bar'] = df.filter(regex='bar').sum(axis=1)

如果您的数据量大于 10,000 行，请不要使用它。一般用axis=1总结比较慢

【讨论】：

如何将其概括为多个块 > 2？
如果您有 > 10000 行，您会建议做什么？

【解决方案2】：

这是一种开始寻找此类“块”的方法：

   chunks = set([re.split('\(\d+\)',i)[1].split('.')[0] for i in df.columns if '.' in i])

for each_chunk in chunks:
        column_name = '%s' %each_chunk
        df[column_name] = df[[i for i in df.columns if each_chunk in i]].sum(axis=1)

## -- End pasted text --

In [1298]: df.head()
Out[1298]: 
          Probe     Gene  (3)bar.ID.LN.x1  (3)bar.ID.LN.x2  (3)bar.ID.LN.x3  \
0    1431492_at     Lipn                1                4                7   
1    1448678_at  Fam118a                2                5                8   
2  1452580_a_at   Mrpl21                3                6                9   

   (5)foo.ID.LN.x1  (5)foo.ID.LN.x2  (5)foo.ID.LN.x3    foo  bar  
0             20.3              130                1  151.3   12  
1             25.3              150                2  177.3   15  
2              3.1              173               12  188.1   18

基准测试：

In [1266]: %timeit df[bar_cols].sum(axis=1)
1000 loops, best of 3: 476 µs per loop

In [1267]: %timeit df[[i for i in df.columns if 'bar' in i]].sum(axis=1)
1000 loops, best of 3: 483 µs per loop

In [1268]: %timeit df.filter(regex='foo').sum(axis=1)
1000 loops, best of 3: 483 µs per loop

【讨论】：

有没有一种方法可以将 t 概括为多个大于 2 的块？而不是硬编码foo_cols 和bar_cols?
这样的？ df[[i for i in df.columns if 'search_word' in i]].sum(axis=1) 并遍历您需要的所有此类列。
我不确定这个小 DateFrame 的时间是否有多大价值，对于更大的例子来说，它会给出更有趣的结果。 :)
完全同意，不过。 :)

【解决方案3】：

如果您要对许多列执行此操作，我建议您使用 MultiIndex 而不是点分隔字符串：

In [11]: new_cols = df.pop("ProbeGenes").str.split().apply(pd.Series)  # do something with this later

In [12]: df.columns = pd.MultiIndex.from_tuples(df.columns.map(lambda x: tuple(x.split("."))))

In [13]: df
Out[13]:
  (3)bar       (5)foo
      ID           ID
      LN           LN
      x1 x2 x3     x1   x2  x3
0      1  4  7   20.3  130   1
1      2  5  8   25.3  150   2
2      3  6  9    3.1  173  12

In [14]: df.loc[:, "(3)bar"].sum(axis=1)
Out[14]:
0    12
1    15
2    18
dtype: int64

【讨论】：

原因，正如我应该编辑的那样，这是在索引中查找“(3)bar”，因此您不需要遍历每一列来查看它是否包含 bar。您可以清理 MI，例如删除/移动 3.