熊猫数据框到嵌套计数器字典答案

【问题标题】：Pandas dataframe to nested counter dictionary熊猫数据框到嵌套计数器字典
【发布时间】：2019-03-26 23:25:15
【问题描述】：

我已经看到很多关于如何将 pandas 数据帧转换为嵌套字典的问题，但没有一个涉及聚合信息。我什至可以在熊猫中做我需要的事情，但我被困住了。

输入

我有一个如下所示的数据框：

  FeatureID    gene  Target  pos  bc_count
0     1_1_1  NRAS_3  TAGCAC    0      0.42
1     1_1_1  NRAS_3  TGCACA    1      1.00
2     1_1_1  NRAS_3  GCACAA    2      0.50
3     1_1_1  NRAS_3  CACAAA    3      2.00
4     1_1_1  NRAS_3  CAGAAA    3      0.42

# create df as below
import pandas as pd
df = pd.DataFrame([{"FeatureID":"1_1_1", "gene":"NRAS_3", "Target":"TAGCAC", 
   "pos":0, "bc_count":.42},
   {"FeatureID":"1_1_1", "gene":"NRAS_3", "Target":"TGCACA", "pos":1, 
   "bc_count":1.00},
   {"FeatureID":"1_1_1", "gene":"NRAS_3", "Target":"GCACAA", "pos":2, 
   "bc_count":0.50},
   {"FeatureID":"1_1_1", "gene":"NRAS_3", "Target":"CACAAA", "pos":3, 
   "bc_count":2.00},
   {"FeatureID":"1_1_1", "gene":"NRAS_3", "Target":"CAGAAA", "pos":4, 
   "bc_count":0.42}])

问题

我需要为每一行拆分 Target 列以返回一个 (position, letter, count) 的元组，其中起始位置在“pos”列中给出，然后枚举后面每个位置的字符串，并且计数是在“bc_count”列中为该行找到的值。

例如，在第一行中，所需的元组列表将是：

[(0, "T", 0.42), (1,"A", 0.42), (2,"G", 0.42), (3,"C", 0.42), (4,"A", 0.42), (5,"C", 0.42)]

我尝试过的

我创建了将目标列分解为找到的位置的代码，返回位置、核苷酸（字母）和该字母计数的元组，并将它们作为列添加到数据框：

def index_target(row):
    count_list = [((row.pos + x),y, 
        row.bc_count) for x,y in 
        enumerate(row.Target)]

df['pos_count'] = df.apply(self.index_target, axis=1)

根据该行的目标列返回每行的元组列表。

我需要为每个目标获取 df 中的每一行，并对计数求和。这就是为什么我想到使用字典作为计数器的原因：

position[letter] += bc_count

我尝试创建一个默认字典，但它是单独附加每个元组列表，而不是对每个位置的计数求和：

from collections import defaultdict

d = defaultdict(dict) # also tried defaultdict(list) here
for x,y,z in row.pos_count:
    d[x][y] += z

所需的输出

对于数据框中的每个特征，下面的数字表示在 bc_count 列中找到的每个位置的单个计数的总和，x 表示找到平局的位置，并且没有一个字母可以作为最大值返回：

pos A   T   G   C
0   25  80  25  57
1   32  19  100 32
2   27  18  16  27
3   90  90  90  90
4   10  42  37  18

共识= TGXXT

【问题讨论】：

对不起，缺少很多依赖，不清楚你是如何从头到尾的。请尝试澄清您的问题。
我已经给出了重现问题的所有代码，包括依赖关系，并清楚地概述了我想要的输出。我希望这能让它更清楚。
@SummerEla 您指出的“所需输出”是否与您的示例输入相匹配？我的意思是，你想要的输出是你从数据框中得到的吗？

标签： python pandas dataframe counter defaultdict

【解决方案1】：

不确定如何获得所需的输出，但我创建了列表d，其中包含您所需的数据帧元组。希望它为您想要创建的内容提供一些方向：

d = []

for t,c,p in zip(df.Target,df.bc_count,df.pos):
    d.extend([(p,c,i) for i in list(t)])

df_new = pd.DataFrame(d, columns = ['pos','count','val'])
df_new = df_new.groupby(['pos','val']).agg({'count':'sum'}).reset_index()

df_new.pivot(index = 'pos', columns = 'val', values = 'count')

【讨论】：

【解决方案2】：

这可能不是最优雅的解决方案，但我认为它可能会满足您的需求：

new_df = pd.DataFrame(
    df.apply(
        # this lambda is basically the same thing you're doing,
        # but we create a pd.Series with it
        lambda row: pd.Series(
            [(row.pos + i, c, row.bc_count) for i, c in enumerate(row.Target)]
        ),
        axis=1)
        .stack().tolist(),
    columns=["pos", "nucl", "count"]

)

new_df 看起来像这样：

  pos nucl count
0   0    T  0.42
1   1    A  0.42
2   2    G  0.42
3   3    C  0.42
4   4    A  0.42
5   5    C  0.42
6   1    T  1.00
7   2    G  1.00
8   3    C  1.00
9   4    A  1.00

然后我会以此为轴来获取汇总计数：

nucleotide_count_by_pos = new_df.pivot_table(
    index="pos",
    columns="nucl",
    values="count",
    aggfunc="sum",
    fill_value=0
)

nucleotide_count_by_pos 的样子：

nucl     A     C     G     T
 pos
   0  0.00  0.00  0.00  0.42
   1  0.42  0.00  0.00  1.00
   2  0.00  0.00  1.92  0.00
   3  0.00  4.34  0.00  0.00
   4  4.34  0.00  0.00  0.00

然后得到共识：

def get_consensus(row):
    max_value = row.max()
    nuc = row.idxmax()
    if (row == max_value).sum() == 1:
        return nuc
   else:
        return "X"

consensus = ''.join(nucleotide_count_by_pos.apply(get_consensus, axis=1).tolist())

在您的示例数据的情况下是：

'TTGCACAAA'

【讨论】：

这太棒了。非常感谢，非常感谢！